18 Oct, 2015

1 commit

  • There is a use-after-free possibility in __ext4_journal_stop() in the
    case that we free the handle in the first jbd2_journal_stop() because
    we're referencing handle->h_err afterwards. This was introduced in
    9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by
    storing the handle->h_err value beforehand and avoid referencing
    potentially freed handle.

    Fixes: 9705acd63b125dee8b15c705216d7186daea4625
    Signed-off-by: Lukas Czerner
    Reviewed-by: Andreas Dilger
    Cc: stable@vger.kernel.org

    Lukas Czerner
     

15 May, 2015

1 commit

  • Currently when journal restart fails, we'll have the h_transaction of
    the handle set to NULL to indicate that the handle has been effectively
    aborted. We handle this situation quietly in the jbd2_journal_stop() and just
    free the handle and exit because everything else has been done before we
    attempted (and failed) to restart the journal.

    Unfortunately there are a number of problems with that approach
    introduced with commit

    41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart()
    fails"

    First of all in ext4 jbd2_journal_stop() will be called through
    __ext4_journal_stop() where we would try to get a hold of the superblock
    by dereferencing h_transaction which in this case would lead to NULL
    pointer dereference and crash.

    In addition we're going to free the handle regardless of the refcount
    which is bad as well, because others up the call chain will still
    reference the handle so we might potentially reference already freed
    memory.

    Moreover it's expected that we'll get aborted handle as well as detached
    handle in some of the journalling function as the error propagates up
    the stack, so it's unnecessary to call WARN_ON every time we get
    detached handle.

    And finally we might leak some memory by forgetting to free reserved
    handle in jbd2_journal_stop() in the case where handle was detached from
    the transaction (h_transaction is NULL).

    Fix the NULL pointer dereference in __ext4_journal_stop() by just
    calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
    the potential memory leak in jbd2_journal_stop() and use proper
    handle refcounting before we attempt to free it to avoid use-after-free
    issues.

    And finally remove all WARN_ON(!transaction) from the code so that we do
    not get random traces when something goes wrong because when journal
    restart fails we will get to some of those functions.

    Cc: stable@vger.kernel.org
    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Lukas Czerner
     

02 Oct, 2014

1 commit


12 May, 2014

1 commit


13 Mar, 2014

1 commit


02 Dec, 2013

1 commit

  • While it's true that errors can only happen if there is a bug in
    jbd2_journal_dirty_metadata(), if a bug does happen, we need to halt
    the kernel or remount the file system read-only in order to avoid
    further data loss. The ext4_journal_abort_handle() function doesn't
    do any of this, and while it's likely that this call (since it doesn't
    adjust refcounts) will likely result in the file system eventually
    deadlocking since the current transaction will never be able to close,
    it's much cleaner to call let ext4's error handling system deal with
    this situation.

    There's a separate bug here which is that if certain jbd2 errors
    errors occur and file system is mounted errors=continue, the file
    system will probably eventually end grind to a halt as described
    above. But things have been this way in a long time, and usually when
    we have these sorts of errors it's pretty much a disaster --- and
    that's why the jbd2 layer aggressively retries memory allocations,
    which is the most likely cause of these jbd2 errors.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

12 Aug, 2013

1 commit

  • When jbd2_journal_dirty_metadata() returns error,
    __ext4_handle_dirty_metadata() stops the handle. However callers of this
    function do not count with that fact and still happily used now freed
    handle. This use after free can result in various issues but very likely
    we oops soon.

    The motivation of adding __ext4_journal_stop() into
    __ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
    improve error reporting. So replace __ext4_journal_stop() with
    ext4_journal_abort_handle() which was there before that commit and add
    WARN_ON_ONCE() to dump stack to provide useful information.

    Reported-by: Sage Weil
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org # 3.2+

    Jan Kara
     

05 Jun, 2013

2 commits

  • Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • In some cases we cannot start a transaction because of locking
    constraints and passing started transaction into those places is not
    handy either because we could block transaction commit for too long.
    Transaction reservation is designed to solve these issues. It
    reserves a handle with given number of credits in the journal and the
    handle can be later attached to the running transaction without
    blocking on commit or checkpointing. Reserved handles do not block
    transaction commit in any way, they only reduce maximum size of the
    running transaction (because we have to always be prepared to
    accomodate request for attaching reserved handle).

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

22 Apr, 2013

1 commit


04 Apr, 2013

1 commit


09 Feb, 2013

2 commits

  • So we can better understand what bits of ext4 are responsible for
    long-running jbd2 handles, use jbd2__journal_start() so we can pass
    context information for logging purposes.

    The recommended way for finding the longer-running handles is:

    T=/sys/kernel/debug/tracing
    EVENT=$T/events/jbd2/jbd2_handle_stats
    echo "interval > 5" > $EVENT/filter
    echo 1 > $EVENT/enable

    ./run-my-fs-benchmark

    cat $T/trace > /tmp/problem-handles

    This will list handles that were active for longer than 20ms. Having
    longer-running handles is bad, because a commit started at the wrong
    time could stall for those 20+ milliseconds, which could delay an
    fsync() or an O_SYNC operation. Here is an example line from the
    trace file describing a handle which lived on for 311 jiffies, or over
    1.2 seconds:

    postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
    tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
    dirtied_blocks 0

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Move the jbd2 wrapper functions which start and stop handles out of
    super.c, where they don't really logically belong, and into
    ext4_jbd2.c.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

10 Oct, 2012

1 commit

  • The function ext4_handle_dirty_super() was calculating the superblock
    on the wrong block data. As a result, when the superblock is modified
    while it is mounted (most commonly, when inodes are added or removed
    from the orphan list), the superblock checksum would be wrong. We
    didn't notice because the superblock *was* being correctly calculated
    in ext4_commit_super(), and this would get called when the file system
    was unmounted. So the problem only became obvious if the system
    crashed while the file system was mounted.

    Fix this by removing the poorly designed function signature for
    ext4_superblock_csum_set(); if it only took a single argument, the
    pointer to a struct superblock, the ambiguity which caused this
    mistake would have been impossible.

    Reported-by: George Spelvin
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

23 Jul, 2012

2 commits

  • The '__ext4_handle_dirty_metadata()' does not need the 'now' argument
    anymore and we can kill it.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Artem Bityutskiy
     
  • This patch changes the 'ext4_handle_dirty_super()' function which
    submits the superblock for I/O in the following cases:

    1. When creating the first large file on a file system without
    EXT4_FEATURE_RO_COMPAT_LARGE_FILE feature.
    2. When re-sizing the file-system.
    3. When creating an xattr on a file-system without the
    EXT4_FEATURE_COMPAT_EXT_ATTR feature.

    If the file-system has journal enabled, the superblock is written via
    the journal. We do not modify this path.

    If the file-system has no journal, this function, falls back to just
    marking the superblock as dirty using the 's_dirt' superblock
    flag. This means that it delays the actual superblock I/O submission
    by 5 seconds (default setting). Namely, the 'sync_supers()' kernel
    thread will call 'ext4_write_super()' later and will actually submit
    the superblock for I/O.

    And this is the behavior this patch modifies: we stop using 's_dirt'
    and just mark the superblock buffer as dirty right away. Indeed, all 3
    cases above are extremely rare and it does not add any value to delay
    the I/O submission for them.

    Note: 'ext4_handle_dirty_super()' executes
    '__ext4_handle_dirty_super()' with 'now = 0'. This patch basically
    makes the 'now' argument unneeded and it will be deleted in one of the
    next patches.

    This patch also removes 's_dirt' condition on the unmount path because
    we never set it anymore, so we should not test it.

    Tested using xfstests for both journalled and non-journalled ext4.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Artem Bityutskiy
     

30 Apr, 2012

1 commit

  • Calculate and verify the superblock checksum. Since the UUID and
    block group number are embedded in each copy of the superblock, we
    need only checksum the entire block. Refactor some of the code to
    eliminate open-coding of the checksum update call.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: "Theodore Ts'o"

    Darrick J. Wong
     

04 Sep, 2011

1 commit

  • Add debugging information in case jbd2_journal_dirty_metadata() is
    called with a buffer_head which didn't have
    jbd2_journal_get_write_access() called on it, or if the journal_head
    has the wrong transaction in it. In addition, return an error code.
    This won't change anything for ocfs2, which will BUG_ON() the non-zero
    exit code.

    For ext4, the caller of this function is ext4_handle_dirty_metadata(),
    and on seeing a non-zero return code, will call __ext4_journal_stop(),
    which will print the function and line number of the (buggy) calling
    function and abort the journal. This will allow us to recover instead
    of bug halting, which is better from a robustness and reliability
    point of view.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

09 May, 2011

1 commit

  • The block allocation code used to use jbd2_journal_get_undo_access as
    a way to make changes that wouldn't show up until the commit took
    place. The new multi-block allocation code has a its own way of
    preventing newly freed blocks from getting reused until the commit
    takes place (it avoids updating the buddy bitmaps until the commit is
    done), so we don't need to use jbd2_journal_get_undo_access(), which
    has extra overhead compared to jbd2_journal_get_write_access().

    There was one last vestigal use of ext4_journal_get_undo_access() in
    ext4_add_groupblocks(); change it to use ext4_journal_get_write_access()
    and then remove the ext4_journal_get_undo_access() support.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

27 Jul, 2010

2 commits


30 Jun, 2010

1 commit


29 Jun, 2010

1 commit


12 Jun, 2010

1 commit

  • We don't need to set s_dirt in most of the ext4 code when journaling
    is enabled. In ext3/4 some of the summary statistics for # of free
    inodes, blocks, and directories are calculated from the per-block
    group statistics when the file system is mounted or unmounted. As a
    result the superblock doesn't have to be updated, either via the
    journal or by setting s_dirt. There are a few exceptions, most
    notably when resizing the file system, where the superblock needs to
    be modified --- and in that case it should be done as a journalled
    operation if possible, and s_dirt set only in no-journal mode.

    This patch will optimize out some unneeded disk writes when using ext4
    with a journal.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

17 Feb, 2010

1 commit

  • Calls to ext4_handle_dirty_metadata should only pass in an inode
    pointer for inode-specific metadata, and not for shared metadata
    blocks such as inode table blocks, block group descriptors, the
    superblock, etc.

    The BUG_ON can get tripped when updating a special device (such as a
    block device) that is opened (so that i_mapping is set in
    fs/block_dev.c) and the file system is mounted in no journal mode.

    Addresses-Google-Bug: #2404870

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"

    Curt Wohlgemuth
     

16 Feb, 2010

1 commit


25 Nov, 2009

1 commit


23 Nov, 2009

2 commits

  • Convert the last two callers of ext4_journal_forget() to use
    ext4_forget() instead, and then fold ext4_journal_forget() into
    ext4_forget(). This reduces are code complexity and shortens our call
    stack.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • The ext4_forget() function better belongs in ext4_jbd2.c. This will
    allow us to do some cleanup of the ext4_journal_revoke() and
    ext4_journal_forget() functions, as well as giving us better error
    reporting since we can report the caller of ext4_forget() when things
    go wrong.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

13 Sep, 2009

1 commit


10 Sep, 2009

1 commit

  • When ext4 is using a journal, a metadata block which is deallocated
    must be passed into the journal layer so it can be dropped from the
    current transaction and/or revoked. This is done by calling the
    functions ext4_journal_forget() and ext4_journal_revoke(), which call
    jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.

    Since the jbd2_journal_forget() and jbd2_journal_revoke() call
    bforget(), if ext4 is not using a journal, ext4_journal_forget() and
    ext4_journal_revoke() must call bforget() to avoid a dirty metadata
    block overwriting a block after it has been reallocated and reused for
    another inode's data block.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

13 Jul, 2009

1 commit

  • We found a problem with buffer head reference leaks when using an ext4
    partition without a journal. In particular, calls to ext4_forget() would
    not to a brelse() on the input buffer head, which will cause pages they
    belong to to not be reclaimable.

    Further investigation showed that all places where ext4_journal_forget() and
    ext4_journal_revoke() are called are subject to the same problem. The patch
    below changes __ext4_journal_forget/__ext4_journal_revoke to do an explicit
    release of the buffer head when the journal handle isn't valid.

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"

    Curt Wohlgemuth
     

07 Jan, 2009

1 commit

  • A few weeks ago I posted a patch for discussion that allowed ext4 to run
    without a journal. Since that time I've integrated the excellent
    comments from Andreas and fixed several serious bugs. We're currently
    running with this patch and generating some performance numbers against
    both ext2 (with backported reservations code) and ext4 with and without
    a journal. It just so happens that running without a journal is
    slightly faster for most everything.

    We did
    iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

    which creates 4 threads, each of which create and do reads and writes on
    a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
    to bypass the page cache. Results:

    ext2 ext4, default ext4, no journal
    initial writes 13.0 MB/s 15.4 MB/s 15.7 MB/s
    rewrites 13.1 MB/s 15.6 MB/s 15.9 MB/s
    reads 15.2 MB/s 16.9 MB/s 17.2 MB/s
    re-reads 15.3 MB/s 16.9 MB/s 17.2 MB/s
    random readers 5.6 MB/s 5.6 MB/s 5.7 MB/s
    random writers 5.1 MB/s 5.3 MB/s 5.4 MB/s

    So it seems that, so far, this was a useful exercise.

    Signed-off-by: Frank Mayhar
    Signed-off-by: "Theodore Ts'o"

    Frank Mayhar
     

30 Apr, 2008

1 commit


17 Apr, 2008

1 commit


08 Dec, 2006

1 commit