04 May, 2013

1 commit


30 Apr, 2013

1 commit


29 Apr, 2013

1 commit


28 Mar, 2013

1 commit

  • In the case where an inode has a very stale transaction id (tid) in
    i_datasync_tid or i_sync_tid, it's possible that after a very large
    (2**31) number of transactions, that the tid number space might wrap,
    causing tid_geq()'s calculations to fail.

    Commit d9b0193 "jbd: fix fsync() tid wraparound bug" attempted to fix
    this problem, but it only avoided kjournald spinning forever by fixing
    the logic in jbd_log_start_commit().

    Signed-off-by: Jan Kara

    Jan Kara
     

15 Jan, 2013

1 commit

  • Don't send an extra wakeup to kjournald in the case where we
    already have the proper target in j_commit_request, i.e. that
    commit has already been requested for commit.

    commit d9b0193 "jbd: fix fsync() tid wraparound bug" changed
    the logic leading to a wakeup, but it caused some extra wakeups
    which were found to lead to a measurable performance regression.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Jan Kara

    Eric Sandeen
     

15 Aug, 2012

1 commit

  • This sequence:

    results in an IO error when unmounting the RO filesystem. The bug was
    introduced by:

    commit 9754e39c7bc51328f145e933bfb0df47cd67b6e9
    Author: Jan Kara
    Date: Sat Apr 7 12:33:03 2012 +0200

    jbd: Split updating of journal superblock and marking journal empty

    which lost some of the magic in journal_update_superblock() which
    used to test for a journal with no outstanding transactions.

    This is a port of a jbd2 fix by Eric Sandeen.

    CC: # 3.4.x
    Signed-off-by: Jan Kara

    Jan Kara
     

04 Aug, 2012

1 commit


16 May, 2012

3 commits

  • If journal superblock is written only in disk's caches and other transaction
    starts reusing space of the transaction cleaned from the log, it can happen
    blocks of a new transaction reach the disk before journal superblock. When
    power failure happens in such case, subsequent journal replay would still try
    to replay the old transaction but some of it's blocks may be already
    overwritten by the new transaction. For this reason we must use WRITE_FUA when
    updating log tail and we must first write new log tail to disk and update
    in-memory information only after that.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • There are some log tail updates that are not protected by j_checkpoint_mutex.
    Some of these are harmless because they happen during startup or shutdown but
    updates in journal_commit_transaction() and journal_flush() can really race
    with other log tail updates (e.g. someone doing journal_flush() with someone
    running cleanup_journal_tail()). So protect all log tail updates with
    j_checkpoint_mutex.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • There are three case of updating journal superblock. In the first case, we want
    to mark journal as empty (setting s_sequence to 0), in the second case we want
    to update log tail, in the third case we want to update s_errno. Split these
    cases into separate functions. It makes the code slightly more straightforward
    and later patches will make the distinction even more important.

    Signed-off-by: Jan Kara

    Jan Kara
     

11 Apr, 2012

1 commit

  • Currently we write out all journal buffers in WRITE_SYNC mode. This improves
    performance for fsync heavy workloads but hinders performance when writes
    are mostly asynchronous, most noticably it slows down readers and users
    complain about slow desktop response etc.

    So submit writes as asynchronous in the normal case and only submit writes as
    WRITE_SYNC if we detect someone is waiting for current transaction commit.

    I've gathered some numbers to back this change. The first is the read latency
    test. It measures time to read 1 MB after several seconds of sleeping in
    presence of streaming writes.

    Top 10 times (out of 90) in us:
    Before After
    2131586 697473
    1709932 557487
    1564598 535642
    1480462 347573
    1478579 323153
    1408496 222181
    1388960 181273
    1329565 181070
    1252486 172832
    1223265 172278

    Average:
    619377 82180

    So the improvement in both maximum and average latency is massive.

    I've measured fsync throughput by:
    fs_mark -n 100 -t 1 -s 16384 -d /mnt/fsync/ -S 1 -L 4

    in presence of streaming reader. The numbers (fsyncs/s) are:
    Before After
    9.9 6.3
    6.8 6.0
    6.3 6.2
    5.8 6.1

    So fsync performance seems unharmed by this change.

    Signed-off-by: Jan Kara

    Jan Kara
     

22 Mar, 2012

1 commit

  • Pull power management updates for 3.4 from Rafael Wysocki:
    "Assorted extensions and fixes including:

    * Introduction of early/late suspend/hibernation device callbacks.
    * Generic PM domains extensions and fixes.
    * devfreq updates from Axel Lin and MyungJoo Ham.
    * Device PM QoS updates.
    * Fixes of concurrency problems with wakeup sources.
    * System suspend and hibernation fixes."

    * tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits)
    PM / Domains: Check domain status during hibernation restore of devices
    PM / devfreq: add relation of recommended frequency.
    PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on()
    PM / shmobile: Make CMT driver use pm_genpd_dev_always_on()
    PM / shmobile: Make TMU driver use pm_genpd_dev_always_on()
    PM / Domains: Introduce "always on" device flag
    PM / Domains: Fix hibernation restore of devices, v2
    PM / Domains: Fix handling of wakeup devices during system resume
    sh_mmcif / PM: Use PM QoS latency constraint
    tmio_mmc / PM: Use PM QoS latency constraint
    PM / QoS: Make it possible to expose PM QoS latency constraints
    PM / Sleep: JBD and JBD2 missing set_freezable()
    PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case
    PM / Freezer: Remove references to TIF_FREEZE in comments
    PM / Sleep: Add more wakeup source initialization routines
    PM / Hibernate: Enable usermodehelpers in hibernate() error path
    PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
    PM / Sleep: Fix race conditions related to wakeup source timer function
    PM / Sleep: Fix possible infinite loop during wakeup source destruction
    PM / Hibernate: print physical addresses consistently with other parts of kernel
    ...

    Linus Torvalds
     

20 Mar, 2012

1 commit


14 Mar, 2012

1 commit

  • With the latest and greatest changes to the freezer, I started seeing
    panics that were caused by jbd2 running post-process freezing and
    hitting the canary BUG_ON for non-TuxOnIce I/O submission. I've traced
    this back to a lack of set_freezable calls in both jbd and jbd2. Since
    they're clearly meant to be frozen (there are tests for freezing()), I
    submit the following patch to add the missing calls.

    Signed-off-by: Nigel Cunningham
    Acked-by: Jan Kara
    Signed-off-by: Rafael J. Wysocki

    Nigel Cunningham
     

10 Jan, 2012

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    ext2/3/4: delete unneeded includes of module.h
    ext{3,4}: Fix potential race when setversion ioctl updates inode
    udf: Mark LVID buffer as uptodate before marking it dirty
    ext3: Don't warn from writepage when readonly inode is spotted after error
    jbd: Remove j_barrier mutex
    reiserfs: Force inode evictions before umount to avoid crash
    reiserfs: Fix quota mount option parsing
    udf: Treat symlink component of type 2 as /
    udf: Fix deadlock when converting file from in-ICB one to normal one
    udf: Cleanup calling convention of inode_getblk()
    ext2: Fix error handling on inode bitmap corruption
    ext3: Fix error handling on inode bitmap corruption
    ext3: replace ll_rw_block with other functions
    ext3: NULL dereference in ext3_evict_inode()
    jbd: clear revoked flag on buffers before a new transaction started
    ext3: call ext3_mark_recovery_complete() when recovery is really needed

    Linus Torvalds
     

09 Jan, 2012

1 commit

  • j_barrier mutex is used for serializing different journal lock operations. The
    problem with it is that e.g. FIFREEZE ioctl results in process leaving kernel
    with j_barrier mutex held which makes lockdep freak out. Also hibernation code
    wants to freeze filesystem but it cannot do so because it then cannot hibernate
    the system because of mutex being locked.

    So we remove j_barrier mutex and use direct wait on j_barrier_count instead.
    Since locking journal is a rare operation we don't have to care about fairness
    or such things.

    CC: Andrew Morton
    Acked-by: Joel Becker
    Signed-off-by: Jan Kara

    Jan Kara
     

22 Nov, 2011

1 commit

  • There is no reason to export two functions for entering the
    refrigerator. Calling refrigerator() instead of try_to_freeze()
    doesn't save anything noticeable or removes any race condition.

    * Rename refrigerator() to __refrigerator() and make it return bool
    indicating whether it scheduled out for freezing.

    * Update try_to_freeze() to return bool and relay the return value of
    __refrigerator() if freezing().

    * Convert all refrigerator() users to try_to_freeze().

    * Update documentation accordingly.

    * While at it, add might_sleep() to try_to_freeze().

    Signed-off-by: Tejun Heo
    Cc: Samuel Ortiz
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Steven Whitehouse
    Cc: Andrew Morton
    Cc: Jan Kara
    Cc: KONISHI Ryusuke
    Cc: Christoph Hellwig

    Tejun Heo
     

02 Nov, 2011

1 commit

  • I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when
    mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3
    image has s_first = 0 in journal superblock, and the 0 is passed to
    journal->j_head in journal_reset(), then to blocknr in
    cleanup_journal_tail(), in the end the J_ASSERT failed.

    So validate s_first after reading journal superblock from disk in
    journal_get_superblock() to ensure s_first is valid.

    The following script could reproduce it:

    fstype=ext3
    blocksize=1024
    img=$fstype.img
    offset=0
    found=0
    magic="c0 3b 39 98"

    dd if=/dev/zero of=$img bs=1M count=8
    mkfs -t $fstype -b $blocksize -F $img
    filesize=`stat -c %s $img`
    while [ $offset -lt $filesize ]
    do
    if od -j $offset -N 4 -t x1 $img | grep -i "$magic";then
    echo "Found journal: $offset"
    found=1
    break
    fi
    offset=`echo "$offset+$blocksize" | bc`
    done

    if [ $found -ne 1 ];then
    echo "Magic \"$magic\" not found"
    exit 1
    fi

    dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1

    mkdir -p ./mnt
    mount -o loop $img ./mnt

    Cc: Jan Kara
    Signed-off-by: Eryu Guan
    Signed-off-by: "Theodore Ts'o"

    Eryu Guan
     

27 Jun, 2011

1 commit

  • journal_remove_journal_head() can oops when trying to access journal_head
    returned by bh2jh(). This is caused for example by the following race:

    TASK1 TASK2
    journal_commit_transaction()
    ...
    processing t_forget list
    __journal_refile_buffer(jh);
    if (!jh->b_transaction) {
    jbd_unlock_bh_state(bh);
    journal_try_to_free_buffers()
    journal_grab_journal_head(bh)
    jbd_lock_bh_state(bh)
    __journal_try_to_free_buffer()
    journal_put_journal_head(jh)
    journal_remove_journal_head(bh);

    journal_put_journal_head() in TASK2 sees that b_jcount == 0 and buffer is not
    part of any transaction and thus frees journal_head before TASK1 gets to doing
    so. Note that even buffer_head can be released by try_to_free_buffers() after
    journal_put_journal_head() which adds even larger opportunity for oops (but I
    didn't see this happen in reality).

    Fix the problem by making transactions hold their own journal_head reference
    (in b_jcount). That way we don't have to remove journal_head explicitely via
    journal_remove_journal_head() and instead just remove journal_head when
    b_jcount drops to zero. The result of this is that [__]journal_refile_buffer(),
    [__]journal_unfile_buffer(), and __journal_remove_checkpoint() can free
    journal_head which needs modification of a few callers. Also we have to be
    careful because once journal_head is removed, buffer_head might be freed as
    well. So we have to get our own buffer_head reference where it matters.

    Signed-off-by: Jan Kara

    Jan Kara
     

25 Jun, 2011

1 commit

  • This commit adds fixed tracepoint for jbd. It has been based on fixed
    tracepoints for jbd2, however there are missing those for collecting
    statistics, since I think that it will require more intrusive patch so I
    should have its own commit, if someone decide that it is needed. Also
    there are new tracepoints in __journal_drop_transaction() and
    journal_update_superblock().

    The list of jbd tracepoints:

    jbd_checkpoint
    jbd_start_commit
    jbd_commit_locking
    jbd_commit_flushing
    jbd_commit_logging
    jbd_drop_transaction
    jbd_end_commit
    jbd_do_submit_data
    jbd_cleanup_journal_tail
    jbd_update_superblock_end

    Signed-off-by: Lukas Czerner
    Cc: Jan Kara
    Signed-off-by: Jan Kara

    Lukas Czerner
     

17 May, 2011

1 commit

  • If an application program does not make any changes to the indirect
    blocks or extent tree, i_datasync_tid will not get updated. If there
    are enough commits (i.e., 2**31) such that tid_geq()'s calculations
    wrap, and there isn't a currently active transaction at the time of
    the fdatasync() call, this can end up triggering a BUG_ON in
    fs/jbd/commit.c:

    J_ASSERT(journal->j_running_transaction != NULL);

    It's pretty rare that this can happen, since it requires the use of
    fdatasync() plus *very* frequent and excessive use of fsync(). But
    with the right workload, it can.

    We fix this by replacing the use of tid_geq() with an equality test,
    since there's only one valid transaction id that is valid for us to
    start: namely, the currently running transaction (if it exists).

    CC: stable@kernel.org
    Reported-by: Martin_Zielinski@McAfee.com
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Jan Kara

    Ted Ts'o
     

31 Mar, 2011

1 commit


01 Mar, 2011

1 commit


28 Oct, 2010

4 commits


18 Aug, 2010

1 commit

  • These flags aren't real I/O types, but tell ll_rw_block to always
    lock the buffer instead of giving up on a failed trylock.

    Instead add a new write_dirty_buffer helper that implements this semantic
    and use it from the existing SWRITE* callers. Note that the ll_rw_block
    code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
    this patch fixes.

    In the ufs code clean up the helper that used to call ll_rw_block
    to mirror sync_dirty_buffer, which is the function it implements for
    compound buffers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

21 Jul, 2010

1 commit


22 May, 2010

2 commits


23 Dec, 2009

1 commit


12 Nov, 2009

1 commit


11 Nov, 2009

1 commit


16 Sep, 2009

1 commit


21 Jul, 2009

1 commit

  • The function journal_write_metadata_buffer() calls jbd_unlock_bh_state(bh_in)
    too early; this could potentially allow another thread to call get_write_access
    on the buffer head, modify the data, and dirty it, and allowing the wrong data
    to be written into the journal. Fortunately, if we lose this race, the only
    time this will actually cause filesystem corruption is if there is a system
    crash or other unclean shutdown of the system before the next commit can take
    place.

    Signed-off-by: dingdinghua
    Acked-by: "Theodore Ts'o"
    Signed-off-by: Jan Kara

    dingdinghua
     

16 Jul, 2009

1 commit


03 Apr, 2009

1 commit


12 Feb, 2009

1 commit

  • journal_start_commit() returns 1 if either a transaction is committing or
    the function has queued a transaction commit. But it returns 0 if we
    raced with somebody queueing the transaction commit as well. This
    resulted in ext3_sync_fs() not functioning correctly (description from
    Arthur Jones): In the case of a data=ordered umount with pending long
    symlinks which are delayed due to a long list of other I/O on the backing
    block device, this causes the buffer associated with the long symlinks to
    not be moved to the inode dirty list in the second phase of fsync_super.
    Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
    flag and the dirty pages are never written to the backing block device,
    causing long symlink corruption and exposing new or previously freed block
    data to userspace.

    This can be reproduced with a script created by Eric Sandeen
    :

    #!/bin/bash

    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    rm -f /mnt/test2/*
    dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
    touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    /mnt/test2/link
    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    ls /mnt/test2/

    This patch fixes journal_start_commit() to always return 1 when there's
    a transaction committing or queued for commit.

    Cc: Eric Sandeen
    Cc: Mike Snitzer
    Cc:
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

23 Oct, 2008

1 commit

  • When a checkpointing IO fails, current JBD code doesn't check the error
    and continue journaling. This means latest metadata can be lost from both
    the journal and filesystem.

    This patch leaves the failed metadata blocks in the journal space and
    aborts journaling in the case of log_do_checkpoint(). To achieve this, we
    need to do:

    1. don't remove the failed buffer from the checkpoint list where in
    the case of __try_to_free_cp_buf() because it may be released or
    overwritten by a later transaction
    2. log_do_checkpoint() is the last chance, remove the failed buffer
    from the checkpoint list and abort the journal
    3. when checkpointing fails, don't update the journal super block to
    prevent the journaled contents from being cleaned. For safety,
    don't update j_tail and j_tail_sequence either
    4. when checkpointing fails, notify this error to the ext3 layer so
    that ext3 don't clear the needs_recovery flag, otherwise the
    journaled contents are ignored and cleaned in the recovery phase
    5. if the recovery fails, keep the needs_recovery flag
    6. prevent cleanup_journal_tail() from being called between
    __journal_drop_transaction() and journal_abort() (a race issue
    between journal_flush() and __log_wait_for_space()

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai