29 Aug, 2014

2 commits

  • It turns out that there are some serious problems with the on-disk
    format of journal checksum v2. The foremost is that the function to
    calculate descriptor tag size returns sizes that are too big. This
    causes alignment issues on some architectures and is compounded by the
    fact that some parts of jbd2 use the structure size (incorrectly) to
    determine the presence of a 64bit journal instead of checking the
    feature flags.

    Therefore, introduce journal checksum v3, which enlarges the
    descriptor block tag format to allow for full 32-bit checksums of
    journal blocks, fix the journal tag function to return the correct
    sizes, and fix the jbd2 recovery code to use feature flags to
    determine 64bitness.

    Add a few function helpers so we don't have to open-code quite so
    many pieces.

    Switching to a 16-byte block size was found to increase journal size
    overhead by a maximum of 0.1%, to convert a 32-bit journal with no
    checksumming to a 32-bit journal with checksum v3 enabled.

    Signed-off-by: Darrick J. Wong
    Reported-by: TR Reardon
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     
  • When recovering the journal, don't fall into an infinite loop if we
    encounter a corrupt journal block. Instead, just skip the block and
    return an error, which fails the mount and thus forces the user to run
    a full filesystem fsck.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

06 Jul, 2014

1 commit


18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Mar, 2014

1 commit


09 Mar, 2014

7 commits


18 Feb, 2014

2 commits

  • Mark functions as static in jbd2/journal.c because they are not used
    outside this file.

    This eliminates the following warning in jbd2/journal.c:
    fs/jbd2/journal.c:125:5: warning: no previous prototype for ‘jbd2_verify_csum_type’ [-Wmissing-prototypes]
    fs/jbd2/journal.c:146:5: warning: no previous prototype for ‘jbd2_superblock_csum_verify’ [-Wmissing-prototypes]
    fs/jbd2/journal.c:154:6: warning: no previous prototype for ‘jbd2_superblock_csum_set’ [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Josh Triplett
    Reviewed-by: Darrick J. Wong

    Rashika Kheria
     
  • If start_this_handle() fails then it leads to a use after free of
    "handle".

    Signed-off-by: Dan Carpenter
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Dan Carpenter
     

09 Dec, 2013

3 commits

  • Rename performed via: perl -pi -e 's/JBD:/JBD2:/g' fs/jbd2/*.c

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Carlos Maiolino

    Dmitry Monakhov
     
  • Some of KERN_EMERG printk messages do not really deserve this log
    level and the one in log_wait_commit() is even rather useless (the
    journal has been previously aborted and *that* is where we should have
    been complaining). So make some messages just KERN_ERR and remove the
    useless message.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • If a handle runs out of space, we currently stop the kernel with a BUG
    in jbd2_journal_dirty_metadata(). This makes it hard to figure out
    what might be going on. So return an error of ENOSPC, so we can let
    the file system layer figure out what is going on, to make it more
    likely we can get useful debugging information). This should make it
    easier to debug problems such as the one which was reported by:

    https://bugzilla.kernel.org/show_bug.cgi?id=44731

    The only two callers of this function are ext4_handle_dirty_metadata()
    and ocfs2_journal_dirty(). The ocfs2 function will trigger a
    BUG_ON(), which means there will be no change in behavior. The ext4
    function will call ext4_error_inode() which will print the useful
    debugging information and then handle the situation using ext4's error
    handling mechanisms (i.e., which might mean halting the kernel or
    remounting the file system read-only).

    Also, since both file systems already call WARN_ON(), drop the WARN_ON
    from jbd2_journal_dirty_metadata() to avoid two stack traces from
    being displayed.

    Signed-off-by: "Theodore Ts'o"
    Cc: ocfs2-devel@oss.oracle.com
    Acked-by: Joel Becker

    Theodore Ts'o
     

29 Aug, 2013

1 commit


01 Jul, 2013

3 commits

  • If jbd2_journal_restart() fails the handle will have been disconnected
    from the current transaction. In this situation, the handle must not
    be used for for any jbd2 function other than jbd2_journal_stop().
    Enforce this with by treating a handle which has a NULL transaction
    pointer as an aborted handle, and issue a kernel warning if
    jbd2_journal_extent(), jbd2_journal_get_write_access(),
    jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.

    This commit also fixes a bug where jbd2_journal_stop() would trip over
    a kernel jbd2 assertion check when trying to free an invalid handle.

    Also move the responsibility of setting current->journal_info to
    start_this_handle(), simplifying the three users of this function.

    Signed-off-by: "Theodore Ts'o"
    Reported-by: Younger Liu
    Cc: Jan Kara

    Theodore Ts'o
     
  • Once we decrement transaction->t_updates, if this is the last handle
    holding the transaction from closing, and once we release the
    t_handle_lock spinlock, it's possible for the transaction to commit
    and be released. In practice with normal kernels, this probably won't
    happen, since the commit happens in a separate kernel thread and it's
    unlikely this could all happen within the space of a few CPU cycles.

    On the other hand, with a real-time kernel, this could potentially
    happen, so save the tid found in transaction->t_tid before we release
    t_handle_lock. It would require an insane configuration, such as one
    where the jbd2 thread was set to a very high real-time priority,
    perhaps because a high priority real-time thread is trying to read or
    write to a file system. But some people who use real-time kernels
    have been known to do insane things, including controlling
    laser-wielding industrial robots. :-)

    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • Some of the functions which modify the jbd2 superblock were not
    updating the checksum before calling jbd2_write_superblock(). Move
    the call to jbd2_superblock_csum_set() to jbd2_write_superblock(), so
    that the checksum is calculated consistently.

    Signed-off-by: "Theodore Ts'o"
    Cc: Darrick J. Wong
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

13 Jun, 2013

6 commits

  • Commit b6e96d0067d8 ("jbd2: use module parameters instead of debugfs
    for jbd_debug") removed any need for a dependency on DEBUG_FS. It
    also moved the /sys variables out from underneath the typical debugfs
    mount point. Delete the dependency and update the /sys path to where
    the debug settings are currently.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • Since the jbd_debug() is implemented with two separate printk()
    calls, it can lead to corrupted and misleading debug output like
    the following (see lines marked with "*"):

    [ 290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
    [ 290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
    [ 290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
    [* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
    [* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
    [* 290.339382] JBD2: starting commit of transaction 42104
    [ 290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
    [ 290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

    i.e. the debug output from log_wait_commit and journal_commit_transaction
    have become interleaved. The output should have been:

    (fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
    (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

    It is expected that this is not easy to replicate -- I was only able
    to cause it on preempt-rt kernels, and even then only under heavy
    I/O load.

    Reported-by: Paul Gortmaker
    Suggested-by: "Theodore Ts'o"
    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • Currently we see this output:

    $git grep phase fs/jbd2
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 1\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 2\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 2\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 3\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 4\n");
    [...]

    There is clearly a duplicate label for phase 2, and they are
    both active (i.e. not in #if ... #else block). Rename them to
    be "2a" and "2b" so the debug output is unambiguous.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • While trying to debug an an issue under extreme I/O loading
    on preempt-rt kernels, the following backtrace was observed
    via SysRQ output:

    rm D ffff8802203afbc0 4600 4878 4748 0x00000000
    ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
    ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
    ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
    Call Trace:
    [] schedule+0x24/0x70
    [] jbd2_log_wait_commit+0xbd/0x140
    [] ? __init_waitqueue_head+0x50/0x50
    [] jbd2_log_do_checkpoint+0xf5/0x520
    [] __jbd2_log_wait_for_space+0xa9/0x1f0
    [] start_this_handle.isra.10+0x2e0/0x530
    [] ? __init_waitqueue_head+0x50/0x50
    [] jbd2__journal_start+0xc3/0x110
    [] ? ext4_rmdir+0x6e/0x230
    [] jbd2_journal_start+0xe/0x10
    [] ext4_journal_start_sb+0x5b/0x160
    [] ext4_rmdir+0x6e/0x230
    [] vfs_rmdir+0xd5/0x140
    [] do_rmdir+0xdf/0x120
    [] ? task_work_run+0x44/0x80
    [] ? do_notify_resume+0x89/0x100
    [] ? int_signal+0x12/0x17
    [] sys_unlinkat+0x25/0x40
    [] system_call_fastpath+0x16/0x1b

    What is interesting here, is that we call log_wait_commit, from
    within wait_for_space, but we are still holding the checkpoint_mutex
    as it surrounds mostly the whole of wait_for_space. And then, as we
    are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
    bit is set, then we will also try to take the same checkpoint_mutex.

    It seems that we need to drop the checkpoint_mutex while sitting in
    jbd2_log_wait_commit, if we want to guarantee that progress can be made
    by jbd2_journal_commit_transaction(). There does not seem to be
    anything preempt-rt specific about this, other then perhaps increasing
    the odds of it happening.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • The state lock is taken after we are doing an assert on the state
    value, not before. So we might in fact be doing an assert on a
    transient value. Ensure the state check is within the scope of
    the state lock being taken.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • Current implementation of jbd2_journal_force_commit() is suboptimal because
    result in empty and useless commits. But callers just want to force and wait
    any unfinished commits. We already have jbd2_journal_force_commit_nested()
    which does exactly what we want, except we are guaranteed that we do not hold
    journal transaction open.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

05 Jun, 2013

8 commits

  • In some cases we cannot start a transaction because of locking
    constraints and passing started transaction into those places is not
    handy either because we could block transaction commit for too long.
    Transaction reservation is designed to solve these issues. It
    reserves a handle with given number of credits in the journal and the
    handle can be later attached to the running transaction without
    blocking on commit or checkpointing. Reserved handles do not block
    transaction commit in any way, they only reduce maximum size of the
    running transaction (because we have to always be prepared to
    accomodate request for attaching reserved handle).

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • j_wait_logspace and j_wait_checkpoint are unused. Remove them.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • jbd2_journal_extend() first checked whether transaction can accept
    extending handle with more credits and then added credits to
    t_outstanding_credits. This can race with start_this_handle() adding
    another handle to a transaction and thus overbooking a transaction.
    Make jbd2_journal_extend() use atomic_add_return() to close the race.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • __jbd2_log_space_left() and jbd_space_needed() were kind of odd.
    jbd_space_needed() accounted also credits needed for currently
    committing transaction while it didn't account for credits needed for
    control blocks. __jbd2_log_space_left() then accounted for control
    blocks as a fraction of free space. Since results of these two
    functions are always only compared against each other, this works
    correct but is somewhat strange. Move the estimates so that
    jbd_space_needed() returns number of blocks needed for a transaction
    including control blocks and __jbd2_log_space_left() returns free
    space in the journal (with the committing transaction already
    subtracted). Rename functions to jbd2_log_space_left() and
    jbd2_space_needed() while we are changing them.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • The comment about credit estimates isn't true anymore. We do what the
    comment describes now.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Currently when we add a buffer to a transaction, we wait until the
    buffer is removed from BJ_Shadow list (so that we prevent any changes
    to the buffer that is just written to the journal). This can take
    unnecessarily long as a lot happens between the time the buffer is
    submitted to the journal and the time when we remove the buffer from
    BJ_Shadow list. (e.g. We wait for all data buffers in the
    transaction, we issue a cache flush, etc.) Also this creates a
    dependency of do_get_write_access() on transaction commit (namely
    waiting for data IO to complete) which we want to avoid when
    implementing transaction reservation.

    So we modify commit code to set new BH_Shadow flag when temporary
    shadowing buffer is created and we clear that flag once IO on that
    buffer is complete. This allows do_get_write_access() to wait only
    for BH_Shadow bit and thus removes the dependency on data IO
    completion.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Similarly as for metadata buffers, also log descriptor buffers don't
    really need the journal head. So strip it and remove BJ_LogCtl list.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • When writing metadata to the journal, we create temporary buffer heads
    for that task. We also attach journal heads to these buffer heads but
    the only purpose of the journal heads is to keep buffers linked in
    transaction's BJ_IO list. We remove the need for journal heads by
    reusing buffer_head's b_assoc_buffers list for that purpose. Also
    since BJ_IO list is just a temporary list for transaction commit, we
    use a private list in jbd2_journal_commit_transaction() for that thus
    removing BJ_IO list from transaction completely.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

28 May, 2013

2 commits


22 May, 2013

1 commit

  • invalidatepage now accepts range to invalidate and there are two file
    system using jbd2 also implementing punch hole feature which can benefit
    from this. We need to implement the same thing for jbd2 layer in order to
    allow those file system take benefit of this functionality.

    This commit adds length argument to the jbd2_journal_invalidatepage()
    and updates all instances in ext4 and ocfs2.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Jan Kara

    Lukas Czerner
     

02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds