30 Jun, 2016

2 commits

  • So far we were tracking only dependency on transaction commit due to
    starting a new handle (which may require commit to start a new
    transaction). Now add tracking also for other cases where we wait for
    transaction commit. This way lockdep can catch deadlocks e. g. because we
    call jbd2_journal_stop() for a synchronous handle with some locks held
    which rank below transaction start.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently lockdep map is tracked in each journal handle. To be able to
    expand lockdep support to cover also other cases where we depend on
    transaction commit and where handle is not available, move lockdep map
    into struct journal_s. Since this makes the lockdep map shared for all
    handles, we have to use rwsem_acquire_read() for acquisitions now.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

06 May, 2016

1 commit


24 Apr, 2016

1 commit

  • Currently when filesystem needs to make sure data is on permanent
    storage before committing a transaction it adds inode to transaction's
    inode list. During transaction commit, jbd2 writes back all dirty
    buffers that have allocated underlying blocks and waits for the IO to
    finish. However when doing writeback for delayed allocated data, we
    allocate blocks and immediately submit the data. Thus asking jbd2 to
    write dirty pages just unnecessarily adds more work to jbd2 possibly
    writing back other redirtied blocks.

    Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
    outstanding data writes before committing a transaction and thus avoid
    unnecessary writes.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

23 Feb, 2016

3 commits


19 Oct, 2015

1 commit

  • If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
    journaling will be aborted first and the error number will be recorded
    into JBD2 superblock and, finally, the system will enter into the
    panic state in "errors=panic" option. But, in the rare case, this
    sequence is little twisted like the below figure and it will happen
    that the system enters into panic state, which means the system reset
    in mobile environment, before completion of recording an error in the
    journal superblock. In this case, e2fsck cannot recognize that the
    filesystem failure occurred in the previous run and the corruption
    wouldn't be fixed.

    Task A Task B
    ext4_handle_error()
    -> jbd2_journal_abort()
    -> __journal_abort_soft()
    -> __jbd2_journal_abort_hard()
    | -> journal->j_flags |= JBD2_ABORT;
    |
    | __ext4_abort()
    | -> jbd2_journal_abort()
    | | -> __journal_abort_soft()
    | | -> if (journal->j_flags & JBD2_ABORT)
    | | return;
    | -> panic()
    |
    -> jbd2_journal_update_sb_errno()

    Tested-by: Hobin Woo
    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Daeho Jeong
     

18 Oct, 2015

2 commits


15 Oct, 2015

1 commit

  • Change the journal's checksum functions to gate on whether or not the
    crc32c driver is loaded, and gate the loading on the superblock bits.
    This prevents a journal crash if someone loads a journal in no-csum
    mode and then randomizes the superblock, thus flipping on the feature
    bits.

    Tested-By: Nikolay Borisov
    Reported-by: Nikolay Borisov
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o

    Darrick J. Wong
     

04 Sep, 2015

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Pretty much all bug fixes and clean ups for 4.3, after a lot of
    features and other churn going into 4.2"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    Revert "ext4: remove block_device_ejected"
    ext4: ratelimit the file system mounted message
    ext4: silence a format string false positive
    ext4: simplify some code in read_mmp_block()
    ext4: don't manipulate recovery flag when freezing no-journal fs
    jbd2: limit number of reserved credits
    ext4 crypto: remove duplicate header file
    ext4: update c/mtime on truncate up
    jbd2: avoid infinite loop when destroying aborted journal
    ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
    ext4 crypto: fix spelling typo in comment
    ext4 crypto: exit cleanly if ext4_derive_key_aes() fails
    ext4: reject journal options for ext2 mounts
    ext4: implement cgroup writeback support
    ext4: replace ext4_io_submit->io_op with ->io_wbc
    ext4 crypto: check for too-short encrypted file names
    ext4 crypto: use a jbd2 transaction when adding a crypto policy
    jbd2: speedup jbd2_journal_dirty_metadata()

    Linus Torvalds
     

29 Jul, 2015

1 commit

  • Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
    superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
    when the journal is aborted. That makes logic in
    jbd2_log_do_checkpoint() bail out which is fine, except that
    jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
    a progress in cleaning the journal. Without it jbd2_journal_destroy()
    just loops in an infinite loop.

    Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
    jbd2_log_do_checkpoint() fails with error.

    Reported-by: Eryu Guan
    Tested-by: Eryu Guan
    Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

24 Jul, 2015

1 commit

  • The functionality of ext3 is fully supported by ext4 driver. Major
    distributions (SUSE, RedHat) already use ext4 driver to handle ext3
    filesystems for quite some time. There is some ugliness in mm resulting
    from jbd cleaning buffers in a dirty page without cleaning page dirty
    bit and also support for buffer bouncing in the block layer when stable
    pages are required is there only because of jbd. So let's remove the
    ext3 driver. This saves us some 28k lines of duplicated code.

    Acked-by: Theodore Ts'o
    Signed-off-by: Jan Kara

    Jan Kara
     

16 Jun, 2015

1 commit

  • If updating journal superblock fails after journal data has been
    flushed, the error is omitted and this will mislead the caller as a
    normal case. In ocfs2, the checkpoint will be treated successfully
    and the other node can get the lock to update. Since the sb_start is
    still pointing to the old log block, it will rewrite the journal data
    during journal recovery by the other node. Thus the new updates will
    be overwritten and ocfs2 corrupts. So in above case we have to return
    the error, and ocfs2_commit_cache will take care of the error and
    prevent the other node to do update first. And only after recovering
    journal it can do the new updates.

    The issue discussion mail can be found at:
    https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
    http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

    [ Fixed bug in patch which allowed a non-negative error return from
    jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
    was causing xfstests ext4/306 to fail. -- Ted ]

    Reported-by: Yiwen Jiang
    Signed-off-by: Joseph Qi
    Signed-off-by: Theodore Ts'o
    Tested-by: Yiwen Jiang
    Cc: Junxiao Bi
    Cc: stable@vger.kernel.org

    Joseph Qi
     

15 Jan, 2015

1 commit


18 Sep, 2014

1 commit


29 Aug, 2014

1 commit

  • It turns out that there are some serious problems with the on-disk
    format of journal checksum v2. The foremost is that the function to
    calculate descriptor tag size returns sizes that are too big. This
    causes alignment issues on some architectures and is compounded by the
    fact that some parts of jbd2 use the structure size (incorrectly) to
    determine the presence of a 64bit journal instead of checking the
    feature flags.

    Therefore, introduce journal checksum v3, which enlarges the
    descriptor block tag format to allow for full 32-bit checksums of
    journal blocks, fix the journal tag function to return the correct
    sizes, and fix the jbd2 recovery code to use feature flags to
    determine 64bitness.

    Add a few function helpers so we don't have to open-code quite so
    many pieces.

    Switching to a 16-byte block size was found to increase journal size
    overhead by a maximum of 0.1%, to convert a 32-bit journal with no
    checksumming to a 32-bit journal with checksum v3 enabled.

    Signed-off-by: Darrick J. Wong
    Reported-by: TR Reardon
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     

01 Jul, 2013

1 commit

  • If jbd2_journal_restart() fails the handle will have been disconnected
    from the current transaction. In this situation, the handle must not
    be used for for any jbd2 function other than jbd2_journal_stop().
    Enforce this with by treating a handle which has a NULL transaction
    pointer as an aborted handle, and issue a kernel warning if
    jbd2_journal_extent(), jbd2_journal_get_write_access(),
    jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.

    This commit also fixes a bug where jbd2_journal_stop() would trip over
    a kernel jbd2 assertion check when trying to free an invalid handle.

    Also move the responsibility of setting current->journal_info to
    start_this_handle(), simplifying the three users of this function.

    Signed-off-by: "Theodore Ts'o"
    Reported-by: Younger Liu
    Cc: Jan Kara

    Theodore Ts'o
     

13 Jun, 2013

4 commits

  • Since the jbd_debug() is implemented with two separate printk()
    calls, it can lead to corrupted and misleading debug output like
    the following (see lines marked with "*"):

    [ 290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
    [ 290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
    [ 290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
    [* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
    [* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
    [* 290.339382] JBD2: starting commit of transaction 42104
    [ 290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
    [ 290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

    i.e. the debug output from log_wait_commit and journal_commit_transaction
    have become interleaved. The output should have been:

    (fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
    (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

    It is expected that this is not easy to replicate -- I was only able
    to cause it on preempt-rt kernels, and even then only under heavy
    I/O load.

    Reported-by: Paul Gortmaker
    Suggested-by: "Theodore Ts'o"
    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • The bit_spinlock functions are only used for the jbd_lock_bh_state
    functions (and friends) in jbd_common.h and are not directly used
    by either of jbd.h or jbd2.h content.

    The jbd_common file is new as of commit 446066724c36 ("jdb/jbd2: factor
    out common functions from the jbd[2] header files") but common
    (and isolated) headers were not considered for factoring at that time.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • Inode's data or non journaled quota may be written w/o jounral so we
    _must_ send a barrier at the end of ext4_sync_fs. But it can be
    skipped if journal commit will do it for us.

    Also fix data integrity for nojournal mode.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Current implementation of jbd2_journal_force_commit() is suboptimal because
    result in empty and useless commits. But callers just want to force and wait
    any unfinished commits. We already have jbd2_journal_force_commit_nested()
    which does exactly what we want, except we are guaranteed that we do not hold
    journal transaction open.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

05 Jun, 2013

6 commits

  • In some cases we cannot start a transaction because of locking
    constraints and passing started transaction into those places is not
    handy either because we could block transaction commit for too long.
    Transaction reservation is designed to solve these issues. It
    reserves a handle with given number of credits in the journal and the
    handle can be later attached to the running transaction without
    blocking on commit or checkpointing. Reserved handles do not block
    transaction commit in any way, they only reduce maximum size of the
    running transaction (because we have to always be prepared to
    accomodate request for attaching reserved handle).

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • j_wait_logspace and j_wait_checkpoint are unused. Remove them.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • __jbd2_log_space_left() and jbd_space_needed() were kind of odd.
    jbd_space_needed() accounted also credits needed for currently
    committing transaction while it didn't account for credits needed for
    control blocks. __jbd2_log_space_left() then accounted for control
    blocks as a fraction of free space. Since results of these two
    functions are always only compared against each other, this works
    correct but is somewhat strange. Move the estimates so that
    jbd_space_needed() returns number of blocks needed for a transaction
    including control blocks and __jbd2_log_space_left() returns free
    space in the journal (with the committing transaction already
    subtracted). Rename functions to jbd2_log_space_left() and
    jbd2_space_needed() while we are changing them.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Currently when we add a buffer to a transaction, we wait until the
    buffer is removed from BJ_Shadow list (so that we prevent any changes
    to the buffer that is just written to the journal). This can take
    unnecessarily long as a lot happens between the time the buffer is
    submitted to the journal and the time when we remove the buffer from
    BJ_Shadow list. (e.g. We wait for all data buffers in the
    transaction, we issue a cache flush, etc.) Also this creates a
    dependency of do_get_write_access() on transaction commit (namely
    waiting for data IO to complete) which we want to avoid when
    implementing transaction reservation.

    So we modify commit code to set new BH_Shadow flag when temporary
    shadowing buffer is created and we clear that flag once IO on that
    buffer is complete. This allows do_get_write_access() to wait only
    for BH_Shadow bit and thus removes the dependency on data IO
    completion.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Similarly as for metadata buffers, also log descriptor buffers don't
    really need the journal head. So strip it and remove BJ_LogCtl list.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • When writing metadata to the journal, we create temporary buffer heads
    for that task. We also attach journal heads to these buffer heads but
    the only purpose of the journal heads is to keep buffers linked in
    transaction's BJ_IO list. We remove the need for journal heads by
    reusing buffer_head's b_assoc_buffers list for that purpose. Also
    since BJ_IO list is just a temporary list for transaction commit, we
    use a private list in jbd2_journal_commit_transaction() for that thus
    removing BJ_IO list from transaction completely.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

22 May, 2013

1 commit

  • invalidatepage now accepts range to invalidate and there are two file
    system using jbd2 also implementing punch hole feature which can benefit
    from this. We need to implement the same thing for jbd2 layer in order to
    allow those file system take benefit of this functionality.

    This commit adds length argument to the jbd2_journal_invalidatepage()
    and updates all instances in ext4 and ocfs2.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Jan Kara

    Lukas Czerner
     

20 Apr, 2013

1 commit


04 Apr, 2013

2 commits

  • The following race is possible:

    [kjournald2] other_task
    jbd2_journal_commit_transaction()
    j_state = T_FINISHED;
    spin_unlock(&journal->j_list_lock);
    ->jbd2_journal_remove_checkpoint()
    ->jbd2_journal_free_transaction();
    ->kmem_cache_free(transaction)
    ->j_commit_callback(journal, transaction);
    -> USE_AFTER_FREE

    WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
    Hardware name:
    list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
    Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
    Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107
    Call Trace:
    [] warn_slowpath_common+0xad/0xf0
    [] warn_slowpath_fmt+0x46/0x50
    [] ? ext4_journal_commit_callback+0x99/0xc0
    [] __list_del_entry+0x1c0/0x250
    [] ext4_journal_commit_callback+0x6f/0xc0
    [] jbd2_journal_commit_transaction+0x23a6/0x2570
    [] ? try_to_del_timer_sync+0x82/0xa0
    [] ? del_timer_sync+0x91/0x1e0
    [] kjournald2+0x19f/0x6a0
    [] ? wake_up_bit+0x40/0x40
    [] ? bit_spin_lock+0x80/0x80
    [] kthread+0x10e/0x120
    [] ? __init_kthread_worker+0x70/0x70
    [] ret_from_fork+0x7c/0xb0
    [] ? __init_kthread_worker+0x70/0x70

    In order to demonstrace this issue one should mount ext4 with mount -o
    discard option on SSD disk. This makes callback longer and race
    window becomes wider.

    In order to fix this we should mark transaction as finished only after
    callbacks have completed

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Dmitry Monakhov
     
  • In the case where an inode has a very stale transaction id (tid) in
    i_datasync_tid or i_sync_tid, it's possible that after a very large
    (2**31) number of transactions, that the tid number space might wrap,
    causing tid_geq()'s calculations to fail.

    Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
    by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
    attempted to fix this problem, but it only avoided kjournald spinning
    forever by fixing the logic in jbd2_log_start_commit().

    Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
    that might call jbd2_log_start_commit() with a stale tid, those
    functions will subsequently call jbd2_log_wait_commit() with the same
    stale tid, and then wait for a very long time. To fix this, we
    replace the calls to jbd2_log_start_commit() and
    jbd2_log_wait_commit() with a call to a new function,
    jbd2_complete_transaction(), which will correctly handle stale tid's.

    As a bonus, jbd2_complete_transaction() will avoid locking
    j_state_lock for writing unless a commit needs to be started. This
    should have a small (but probably not measurable) improvement for
    ext4's scalability.

    Signed-off-by: "Theodore Ts'o"
    Reported-by: Ben Hutchings
    Reported-by: George Barnett
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

10 Feb, 2013

1 commit

  • There are multiple reasons to move away from debugfs. First of all,
    we are only using it for a single parameter, and it is much more
    complicated to set up (some 30 lines of code compared to 3), and one
    more thing that might fail while loading the jbd2 module.

    Secondly, as a module paramter it can be specified as a boot option if
    jbd2 is built into the kernel, or as a parameter when the module is
    loaded, and it can also be manipulated dynamically under
    /sys/module/jbd2/parameters/jbd2_debug. So it is more flexible.

    Ultimately we want to move away from using jbd_debug() towards
    tracepoints, but for now this is still a useful simplification of the
    code base.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

09 Feb, 2013

1 commit


07 Feb, 2013

2 commits

  • This reverts commit 93737456d68ddcb86232f669b83da673dd12e351.

    The cow-snapshots effort is no longer active, so remove these extra
    fields to shrink down the handle structure.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Theodore Ts'o
     
  • Track the delay between when we first request that the commit begin
    and when it actually begins, so we can see how much of a gap exists.
    In theory, this should just be the remaining scheduling quantuum of
    the thread which requested the commit (assuming it was not a
    synchronous operation which triggered the commit request) plus
    scheduling overhead; however, it's possible that real time processes
    might get in the way of letting the kjournald thread from executing.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

26 Dec, 2012

1 commit

  • We cannot wait for transaction commit in journal_unmap_buffer()
    because we hold page lock which ranks below transaction start. We
    solve the issue by bailing out of journal_unmap_buffer() and
    jbd2_journal_invalidatepage() with -EBUSY. Caller is then responsible
    for waiting for transaction commit to finish and try invalidation
    again. Since the issue can happen only for page stradding i_size, it
    is simple enough to manually call jbd2_journal_invalidatepage() for
    such page from ext4_setattr(), check the return value and wait if
    necessary.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

09 Nov, 2012

2 commits

  • The use of variable length arrays in structs (VLAIS) in the Linux Kernel code
    precludes the use of compilers which don't implement VLAIS (for instance the
    Clang compiler). Since ctx is always a 32-bit CRC, hard coding a size of 4
    bytes accomplishes the same thing without the use of VLAIS. This is the same
    technique already employed in fs/ext4/ext4.h

    Signed-off-by: Mark Charlebois
    Signed-off-by: Behan Webster
    Signed-off-by: "Theodore Ts'o"

    Behan Webster
     
  • ext4_handle_release_buffer() was intended to remove journal
    write access from a buffer, but it doesn't actually do anything
    at all other than add a BUFFER_TRACE point, but it's not reliably
    used for that either. Remove all the associated dead code.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Carlos Maiolino

    Eric Sandeen