13 Dec, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Lots of bugs fixes, including Zheng and Jan's extent status shrinker
    fixes, which should improve CPU utilization and potential soft lockups
    under heavy memory pressure, and Eric Whitney's bigalloc fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
    ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
    ext4: fix suboptimal seek_{data,hole} extents traversial
    ext4: ext4_inline_data_fiemap should respect callers argument
    ext4: prevent fsreentrance deadlock for inline_data
    ext4: forbid journal_async_commit in data=ordered mode
    jbd2: remove unnecessary NULL check before iput()
    ext4: Remove an unnecessary check for NULL before iput()
    ext4: remove unneeded code in ext4_unlink
    ext4: don't count external journal blocks as overhead
    ext4: remove never taken branch from ext4_ext_shift_path_extents()
    ext4: create nojournal_checksum mount option
    ext4: update comments regarding ext4_delete_inode()
    ext4: cleanup GFP flags inside resize path
    ext4: introduce aging to extent status tree
    ext4: cleanup flag definitions for extent status tree
    ext4: limit number of scanned extents in status tree shrinker
    ext4: move handling of list of shrinkable inodes into extent status code
    ext4: change LRU to round-robin in extent status tree shrinker
    ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
    ext4: fix block reservation for bigalloc filesystems
    ...

    Linus Torvalds
     

02 Dec, 2014

1 commit

  • When we're enabling journal features, we cannot use the predicate
    jbd2_journal_has_csum_v2or3() because we haven't yet set the sb
    feature flag fields! Moreover, we just finished loading the shash
    driver, so the test is unnecessary; calculate the seed always.

    Without this patch, we fail to initialize the checksum seed the first
    time we turn on journal_checksum, which means that all journal blocks
    written during that first mount are corrupt. Transactions written
    after the second mount will be fine, since the feature flag will be
    set in the journal superblock. xfstests generic/{034,321,322} are the
    regression tests.

    (This is important for 3.18.)

    Signed-off-by: Darrick J. Wong
    Reported-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Darrick J. Wong
     

26 Nov, 2014

1 commit


30 Oct, 2014

1 commit


18 Sep, 2014

2 commits

  • __jbd2_journal_clean_checkpoint_list() returns number of buffers it
    freed but noone was using the value so just stop doing that. This
    also allows for simplifying the calling convention for
    journal_clean_once_cp_list().

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Yuanhan has reported that when he is running fsync(2) heavy workload
    creating new files over ramdisk, significant amount of time is spent in
    __jbd2_journal_clean_checkpoint_list() trying to clean old transactions
    (but they cannot be cleaned up because flusher hasn't yet checkpointed
    those buffers). The workload can be generated by:
    fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096

    Reduce the amount of scanning by stopping to scan the transaction list
    once we find a transaction that cannot be checkpointed. Note that this
    way of cleaning is still enough to keep freeing space in the journal
    after fully checkpointed transactions.

    Reported-and-tested-by: Yuanhan Liu
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

17 Sep, 2014

2 commits

  • If EIO happens after we have dropped j_state_lock, we won't notice
    that the journal has been aborted. So it is reasonable to move this
    check after we have grabbed the j_checkpoint_mutex and re-grabbed the
    j_state_lock. This patch helps to prevent false positive complain
    after EIO.

    #DMESG:
    __jbd2_log_wait_for_space: needed 8448 blocks and only had 8386 space available
    __jbd2_log_wait_for_space: no way to get more journal space in ram1-8
    ------------[ cut here ]------------
    WARNING: CPU: 15 PID: 6739 at fs/jbd2/checkpoint.c:168 __jbd2_log_wait_for_space+0x188/0x200()
    Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
    CPU: 15 PID: 6739 Comm: fsstress Tainted: G W 3.17.0-rc2-00429-g684de57 #139
    Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
    00000000000000a8 ffff88077aaab878 ffffffff815c1a8c 00000000000000a8
    0000000000000000 ffff88077aaab8b8 ffffffff8106ce8c ffff88077aaab898
    ffff8807c57e6000 ffff8807c57e6028 0000000000002100 ffff8807c57e62f0
    Call Trace:
    [] dump_stack+0x51/0x6d
    [] warn_slowpath_common+0x8c/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] __jbd2_log_wait_for_space+0x188/0x200
    [] start_this_handle+0x4da/0x7b0
    [] ? local_clock+0x25/0x30
    [] ? lockdep_init_map+0xe7/0x180
    [] jbd2__journal_start+0xdc/0x1d0
    [] ? __ext4_new_inode+0x7f4/0x1330
    [] __ext4_journal_start_sb+0xf8/0x110
    [] __ext4_new_inode+0x7f4/0x1330
    [] ? lock_release_holdtime+0x29/0x190
    [] ext4_create+0x8b/0x150
    [] vfs_create+0x7b/0xb0
    [] do_last+0x7db/0xcf0
    [] ? inode_permission+0x4d/0x50
    [] path_openat+0x242/0x590
    [] ? __alloc_fd+0x36/0x140
    [] do_filp_open+0x4a/0xb0
    [] ? __alloc_fd+0x121/0x140
    [] do_sys_open+0x170/0x220
    [] SyS_open+0x1e/0x20
    [] SyS_creat+0x16/0x20
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace cd71c831f82059db ]---

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Theodore Ts'o

    Dmitry Monakhov
     
  • Free the buffer head if the journal descriptor block fails checksum
    verification.

    This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum
    verify error in do_one_pass".

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Eric Sandeen
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     

11 Sep, 2014

1 commit


05 Sep, 2014

3 commits

  • Sicne the jbd/jbd2 superblock is not released until the file system is
    unmounted, allocate the buffer cache from the non-moveable area to
    allow page migration and CMA allocations to more easily succeed.

    Signed-off-by: Gioh Kim
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Gioh Kim
     
  • When we discover written out buffer in transaction checkpoint list we
    don't have to recheck validity of a transaction. Either this is the
    last buffer in a transaction - and then we are done - or this isn't
    and then we can just take another buffer from the checkpoint list
    without dropping j_list_lock.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • The __jbd2_journal_remove_checkpoint() doesn't require an elevated
    b_count; indeed, until the jh structure gets released by the call to
    jbd2_journal_put_journal_head(), the bh's b_count is elevated by
    virtue of the existence of the jh structure.

    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

02 Sep, 2014

2 commits

  • __wait_cp_io() is only called by jbd2_log_do_checkpoint(). Fold it in
    to make it a bit easier to understand.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • __process_buffer() is only called by jbd2_log_do_checkpoint(), and it
    had a very complex locking protocol where it would be called with the
    j_list_lock, and sometimes exit with the lock held (if the return code
    was 0), or release the lock.

    This was confusing both to humans and to smatch (which erronously
    complained that the lock was taken twice).

    Folding __process_buffer() to the caller allows us to simplify the
    control flow, making the resulting function easier to read and reason
    about, and dropping the compiled size of fs/jbd2/checkpoint.c by 150
    bytes (over 4% of the text size).

    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Theodore Ts'o
     

29 Aug, 2014

2 commits

  • It turns out that there are some serious problems with the on-disk
    format of journal checksum v2. The foremost is that the function to
    calculate descriptor tag size returns sizes that are too big. This
    causes alignment issues on some architectures and is compounded by the
    fact that some parts of jbd2 use the structure size (incorrectly) to
    determine the presence of a 64bit journal instead of checking the
    feature flags.

    Therefore, introduce journal checksum v3, which enlarges the
    descriptor block tag format to allow for full 32-bit checksums of
    journal blocks, fix the journal tag function to return the correct
    sizes, and fix the jbd2 recovery code to use feature flags to
    determine 64bitness.

    Add a few function helpers so we don't have to open-code quite so
    many pieces.

    Switching to a 16-byte block size was found to increase journal size
    overhead by a maximum of 0.1%, to convert a 32-bit journal with no
    checksumming to a 32-bit journal with checksum v3 enabled.

    Signed-off-by: Darrick J. Wong
    Reported-by: TR Reardon
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     
  • When recovering the journal, don't fall into an infinite loop if we
    encounter a corrupt journal block. Instead, just skip the block and
    return an error, which fails the mount and thus forces the user to run
    a full filesystem fsck.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

06 Jul, 2014

1 commit


18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Mar, 2014

1 commit


09 Mar, 2014

7 commits


18 Feb, 2014

2 commits

  • Mark functions as static in jbd2/journal.c because they are not used
    outside this file.

    This eliminates the following warning in jbd2/journal.c:
    fs/jbd2/journal.c:125:5: warning: no previous prototype for ‘jbd2_verify_csum_type’ [-Wmissing-prototypes]
    fs/jbd2/journal.c:146:5: warning: no previous prototype for ‘jbd2_superblock_csum_verify’ [-Wmissing-prototypes]
    fs/jbd2/journal.c:154:6: warning: no previous prototype for ‘jbd2_superblock_csum_set’ [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Josh Triplett
    Reviewed-by: Darrick J. Wong

    Rashika Kheria
     
  • If start_this_handle() fails then it leads to a use after free of
    "handle".

    Signed-off-by: Dan Carpenter
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Dan Carpenter
     

09 Dec, 2013

3 commits

  • Rename performed via: perl -pi -e 's/JBD:/JBD2:/g' fs/jbd2/*.c

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Carlos Maiolino

    Dmitry Monakhov
     
  • Some of KERN_EMERG printk messages do not really deserve this log
    level and the one in log_wait_commit() is even rather useless (the
    journal has been previously aborted and *that* is where we should have
    been complaining). So make some messages just KERN_ERR and remove the
    useless message.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • If a handle runs out of space, we currently stop the kernel with a BUG
    in jbd2_journal_dirty_metadata(). This makes it hard to figure out
    what might be going on. So return an error of ENOSPC, so we can let
    the file system layer figure out what is going on, to make it more
    likely we can get useful debugging information). This should make it
    easier to debug problems such as the one which was reported by:

    https://bugzilla.kernel.org/show_bug.cgi?id=44731

    The only two callers of this function are ext4_handle_dirty_metadata()
    and ocfs2_journal_dirty(). The ocfs2 function will trigger a
    BUG_ON(), which means there will be no change in behavior. The ext4
    function will call ext4_error_inode() which will print the useful
    debugging information and then handle the situation using ext4's error
    handling mechanisms (i.e., which might mean halting the kernel or
    remounting the file system read-only).

    Also, since both file systems already call WARN_ON(), drop the WARN_ON
    from jbd2_journal_dirty_metadata() to avoid two stack traces from
    being displayed.

    Signed-off-by: "Theodore Ts'o"
    Cc: ocfs2-devel@oss.oracle.com
    Acked-by: Joel Becker

    Theodore Ts'o
     

29 Aug, 2013

1 commit


01 Jul, 2013

3 commits

  • If jbd2_journal_restart() fails the handle will have been disconnected
    from the current transaction. In this situation, the handle must not
    be used for for any jbd2 function other than jbd2_journal_stop().
    Enforce this with by treating a handle which has a NULL transaction
    pointer as an aborted handle, and issue a kernel warning if
    jbd2_journal_extent(), jbd2_journal_get_write_access(),
    jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.

    This commit also fixes a bug where jbd2_journal_stop() would trip over
    a kernel jbd2 assertion check when trying to free an invalid handle.

    Also move the responsibility of setting current->journal_info to
    start_this_handle(), simplifying the three users of this function.

    Signed-off-by: "Theodore Ts'o"
    Reported-by: Younger Liu
    Cc: Jan Kara

    Theodore Ts'o
     
  • Once we decrement transaction->t_updates, if this is the last handle
    holding the transaction from closing, and once we release the
    t_handle_lock spinlock, it's possible for the transaction to commit
    and be released. In practice with normal kernels, this probably won't
    happen, since the commit happens in a separate kernel thread and it's
    unlikely this could all happen within the space of a few CPU cycles.

    On the other hand, with a real-time kernel, this could potentially
    happen, so save the tid found in transaction->t_tid before we release
    t_handle_lock. It would require an insane configuration, such as one
    where the jbd2 thread was set to a very high real-time priority,
    perhaps because a high priority real-time thread is trying to read or
    write to a file system. But some people who use real-time kernels
    have been known to do insane things, including controlling
    laser-wielding industrial robots. :-)

    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • Some of the functions which modify the jbd2 superblock were not
    updating the checksum before calling jbd2_write_superblock(). Move
    the call to jbd2_superblock_csum_set() to jbd2_write_superblock(), so
    that the checksum is calculated consistently.

    Signed-off-by: "Theodore Ts'o"
    Cc: Darrick J. Wong
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

13 Jun, 2013

4 commits

  • Commit b6e96d0067d8 ("jbd2: use module parameters instead of debugfs
    for jbd_debug") removed any need for a dependency on DEBUG_FS. It
    also moved the /sys variables out from underneath the typical debugfs
    mount point. Delete the dependency and update the /sys path to where
    the debug settings are currently.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • Since the jbd_debug() is implemented with two separate printk()
    calls, it can lead to corrupted and misleading debug output like
    the following (see lines marked with "*"):

    [ 290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
    [ 290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
    [ 290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
    [* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
    [* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
    [* 290.339382] JBD2: starting commit of transaction 42104
    [ 290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
    [ 290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

    i.e. the debug output from log_wait_commit and journal_commit_transaction
    have become interleaved. The output should have been:

    (fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
    (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

    It is expected that this is not easy to replicate -- I was only able
    to cause it on preempt-rt kernels, and even then only under heavy
    I/O load.

    Reported-by: Paul Gortmaker
    Suggested-by: "Theodore Ts'o"
    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • Currently we see this output:

    $git grep phase fs/jbd2
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 1\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 2\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 2\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 3\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 4\n");
    [...]

    There is clearly a duplicate label for phase 2, and they are
    both active (i.e. not in #if ... #else block). Rename them to
    be "2a" and "2b" so the debug output is unambiguous.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • While trying to debug an an issue under extreme I/O loading
    on preempt-rt kernels, the following backtrace was observed
    via SysRQ output:

    rm D ffff8802203afbc0 4600 4878 4748 0x00000000
    ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
    ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
    ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
    Call Trace:
    [] schedule+0x24/0x70
    [] jbd2_log_wait_commit+0xbd/0x140
    [] ? __init_waitqueue_head+0x50/0x50
    [] jbd2_log_do_checkpoint+0xf5/0x520
    [] __jbd2_log_wait_for_space+0xa9/0x1f0
    [] start_this_handle.isra.10+0x2e0/0x530
    [] ? __init_waitqueue_head+0x50/0x50
    [] jbd2__journal_start+0xc3/0x110
    [] ? ext4_rmdir+0x6e/0x230
    [] jbd2_journal_start+0xe/0x10
    [] ext4_journal_start_sb+0x5b/0x160
    [] ext4_rmdir+0x6e/0x230
    [] vfs_rmdir+0xd5/0x140
    [] do_rmdir+0xdf/0x120
    [] ? task_work_run+0x44/0x80
    [] ? do_notify_resume+0x89/0x100
    [] ? int_signal+0x12/0x17
    [] sys_unlinkat+0x25/0x40
    [] system_call_fastpath+0x16/0x1b

    What is interesting here, is that we call log_wait_commit, from
    within wait_for_space, but we are still holding the checkpoint_mutex
    as it surrounds mostly the whole of wait_for_space. And then, as we
    are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
    bit is set, then we will also try to take the same checkpoint_mutex.

    It seems that we need to drop the checkpoint_mutex while sitting in
    jbd2_log_wait_commit, if we want to guarantee that progress can be made
    by jbd2_journal_commit_transaction(). There does not seem to be
    anything preempt-rt specific about this, other then perhaps increasing
    the odds of it happening.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker