24 Apr, 2016

1 commit

  • Currently when filesystem needs to make sure data is on permanent
    storage before committing a transaction it adds inode to transaction's
    inode list. During transaction commit, jbd2 writes back all dirty
    buffers that have allocated underlying blocks and waits for the IO to
    finish. However when doing writeback for delayed allocated data, we
    allocate blocks and immediately submit the data. Thus asking jbd2 to
    write dirty pages just unnecessarily adds more work to jbd2 possibly
    writing back other redirtied blocks.

    Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
    outstanding data writes before committing a transaction and thus avoid
    unnecessary writes.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Feb, 2016

4 commits


18 Oct, 2015

1 commit


29 Jul, 2015

1 commit

  • Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
    superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
    when the journal is aborted. That makes logic in
    jbd2_log_do_checkpoint() bail out which is fine, except that
    jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
    a progress in cleaning the journal. Without it jbd2_journal_destroy()
    just loops in an infinite loop.

    Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
    jbd2_log_do_checkpoint() fails with error.

    Reported-by: Eryu Guan
    Tested-by: Eryu Guan
    Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

29 Aug, 2014

1 commit

  • It turns out that there are some serious problems with the on-disk
    format of journal checksum v2. The foremost is that the function to
    calculate descriptor tag size returns sizes that are too big. This
    causes alignment issues on some architectures and is compounded by the
    fact that some parts of jbd2 use the structure size (incorrectly) to
    determine the presence of a 64bit journal instead of checking the
    feature flags.

    Therefore, introduce journal checksum v3, which enlarges the
    descriptor block tag format to allow for full 32-bit checksums of
    journal blocks, fix the journal tag function to return the correct
    sizes, and fix the jbd2 recovery code to use feature flags to
    determine 64bitness.

    Add a few function helpers so we don't have to open-code quite so
    many pieces.

    Switching to a 16-byte block size was found to increase journal size
    overhead by a maximum of 0.1%, to convert a 32-bit journal with no
    checksumming to a 32-bit journal with checksum v3 enabled.

    Signed-off-by: Darrick J. Wong
    Reported-by: TR Reardon
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     

18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Mar, 2014

3 commits

  • We don't otherwise need j_list_lock during the rest of commit phase
    #7, so add the transaction to the checkpoint list at the very end of
    commit phase #6. This allows us to drop j_list_lock earlier, which is
    a good thing since it is a super hot lock.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • The two hottest locks, and thus the biggest scalability bottlenecks,
    in the jbd2 layer, are the j_list_lock and j_state_lock. This has
    inspired some people to do some truly unnatural things[1].

    [1] https://www.usenix.org/system/files/conference/fast14/fast14-paper_kang.pdf

    We don't need to be holding both j_state_lock and j_list_lock while
    calculating the journal statistics, so move those calculations to the
    very end of jbd2_journal_commit_transaction.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • During commit process, keep the block device plugged after we are done
    writing the revoke records, until we are finished writing the rest of
    the commit records in the journal. This will allow most of the
    journal blocks to be written in a single I/O operation, instead of
    separating the the revoke blocks from the rest of the journal blocks.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

29 Aug, 2013

1 commit


13 Jun, 2013

2 commits

  • Currently we see this output:

    $git grep phase fs/jbd2
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 1\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 2\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 2\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 3\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 4\n");
    [...]

    There is clearly a duplicate label for phase 2, and they are
    both active (i.e. not in #if ... #else block). Rename them to
    be "2a" and "2b" so the debug output is unambiguous.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • The state lock is taken after we are doing an assert on the state
    value, not before. So we might in fact be doing an assert on a
    transient value. Ensure the state check is within the scope of
    the state lock being taken.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     

05 Jun, 2013

4 commits

  • In some cases we cannot start a transaction because of locking
    constraints and passing started transaction into those places is not
    handy either because we could block transaction commit for too long.
    Transaction reservation is designed to solve these issues. It
    reserves a handle with given number of credits in the journal and the
    handle can be later attached to the running transaction without
    blocking on commit or checkpointing. Reserved handles do not block
    transaction commit in any way, they only reduce maximum size of the
    running transaction (because we have to always be prepared to
    accomodate request for attaching reserved handle).

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Currently when we add a buffer to a transaction, we wait until the
    buffer is removed from BJ_Shadow list (so that we prevent any changes
    to the buffer that is just written to the journal). This can take
    unnecessarily long as a lot happens between the time the buffer is
    submitted to the journal and the time when we remove the buffer from
    BJ_Shadow list. (e.g. We wait for all data buffers in the
    transaction, we issue a cache flush, etc.) Also this creates a
    dependency of do_get_write_access() on transaction commit (namely
    waiting for data IO to complete) which we want to avoid when
    implementing transaction reservation.

    So we modify commit code to set new BH_Shadow flag when temporary
    shadowing buffer is created and we clear that flag once IO on that
    buffer is complete. This allows do_get_write_access() to wait only
    for BH_Shadow bit and thus removes the dependency on data IO
    completion.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Similarly as for metadata buffers, also log descriptor buffers don't
    really need the journal head. So strip it and remove BJ_LogCtl list.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • When writing metadata to the journal, we create temporary buffer heads
    for that task. We also attach journal heads to these buffer heads but
    the only purpose of the journal heads is to keep buffers linked in
    transaction's BJ_IO list. We remove the need for journal heads by
    reusing buffer_head's b_assoc_buffers list for that purpose. Also
    since BJ_IO list is just a temporary list for transaction commit, we
    use a private list in jbd2_journal_commit_transaction() for that thus
    removing BJ_IO list from transaction completely.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

28 May, 2013

1 commit

  • Al Viro complained of a ton of bogosity with regards to the jbd2 block
    tag header checksum. This one checksum is 16 bits, so cut off the
    upper 16 bits and treat it as a 16-bit value and don't mess around
    with be32* conversions. Fortunately metadata checksumming is still
    "experimental" and not in a shipping e2fsprogs, so there should be few
    users affected by this.

    Reported-by: Al Viro
    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     

04 Apr, 2013

1 commit

  • The following race is possible:

    [kjournald2] other_task
    jbd2_journal_commit_transaction()
    j_state = T_FINISHED;
    spin_unlock(&journal->j_list_lock);
    ->jbd2_journal_remove_checkpoint()
    ->jbd2_journal_free_transaction();
    ->kmem_cache_free(transaction)
    ->j_commit_callback(journal, transaction);
    -> USE_AFTER_FREE

    WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
    Hardware name:
    list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
    Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
    Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107
    Call Trace:
    [] warn_slowpath_common+0xad/0xf0
    [] warn_slowpath_fmt+0x46/0x50
    [] ? ext4_journal_commit_callback+0x99/0xc0
    [] __list_del_entry+0x1c0/0x250
    [] ext4_journal_commit_callback+0x6f/0xc0
    [] jbd2_journal_commit_transaction+0x23a6/0x2570
    [] ? try_to_del_timer_sync+0x82/0xa0
    [] ? del_timer_sync+0x91/0x1e0
    [] kjournald2+0x19f/0x6a0
    [] ? wake_up_bit+0x40/0x40
    [] ? bit_spin_lock+0x80/0x80
    [] kthread+0x10e/0x120
    [] ? __init_kthread_worker+0x70/0x70
    [] ret_from_fork+0x7c/0xb0
    [] ? __init_kthread_worker+0x70/0x70

    In order to demonstrace this issue one should mount ext4 with mount -o
    discard option on SSD disk. This makes callback longer and race
    window becomes wider.

    In order to fix this we should mark transaction as finished only after
    callbacks have completed

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Dmitry Monakhov
     

07 Feb, 2013

1 commit

  • Track the delay between when we first request that the commit begin
    and when it actually begins, so we can see how much of a gap exists.
    In theory, this should just be the remaining scheduling quantuum of
    the thread which requested the commit (assuming it was not a
    synchronous operation which triggered the commit request) plus
    scheduling overhead; however, it's possible that real time processes
    might get in the way of letting the kjournald thread from executing.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

27 Sep, 2012

1 commit

  • ext4 users of data=journal mode with blocksize < pagesize were
    occasionally hitting assertion failure in
    jbd2_journal_commit_transaction() checking whether the transaction has
    at least as many credits reserved as buffers attached. The core of the
    problem is that when a file gets truncated, buffers that still need
    checkpointing or that are attached to the committing transaction are
    left with buffer_mapped set. When this happens to buffers beyond i_size
    attached to a page stradding i_size, subsequent write extending the file
    will see these buffers and as they are mapped (but underlying blocks
    were freed) things go awry from here.

    The assertion failure just coincidentally (and in this case luckily as
    we would start corrupting filesystem) triggers due to journal_head not
    being properly cleaned up as well.

    We fix the problem by unmapping buffers if possible (in lots of cases we
    just need a buffer attached to a transaction as a place holder but it
    must not be written out anyway). And in one case, we just have to bite
    the bullet and wait for transaction commit to finish.

    CC: Josef Bacik
    Signed-off-by: Jan Kara

    Jan Kara
     

23 Jul, 2012

1 commit


27 May, 2012

3 commits


23 May, 2012

1 commit


24 Apr, 2012

1 commit

  • flush request is issued in transaction commit code path, so looks using
    GFP_KERNEL to allocate memory for flush request bio falls into the classic
    deadlock issue. I saw btrfs and dm get it right, but ext4, xfs and md are
    using GFP.

    Signed-off-by: Shaohua Li
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Shaohua Li
     

29 Mar, 2012

3 commits

  • …m/linux/kernel/git/dhowells/linux-asm_system

    Pull "Disintegrate and delete asm/system.h" from David Howells:
    "Here are a bunch of patches to disintegrate asm/system.h into a set of
    separate bits to relieve the problem of circular inclusion
    dependencies.

    I've built all the working defconfigs from all the arches that I can
    and made sure that they don't break.

    The reason for these patches is that I recently encountered a circular
    dependency problem that came about when I produced some patches to
    optimise get_order() by rewriting it to use ilog2().

    This uses bitops - and on the SH arch asm/bitops.h drags in
    asm-generic/get_order.h by a circuituous route involving asm/system.h.

    The main difficulty seems to be asm/system.h. It holds a number of
    low level bits with no/few dependencies that are commonly used (eg.
    memory barriers) and a number of bits with more dependencies that
    aren't used in many places (eg. switch_to()).

    These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

    Move memory barriers here. This already done for MIPS and Alpha.

    (2) asm/switch_to.h

    Move switch_to() and related stuff here.

    (3) asm/exec.h

    Move arch_align_stack() here. Other process execution related bits
    could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

    Move xchg() and cmpxchg() here as they're full word atomic ops and
    frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

    Move die() and related bits.

    (6) asm/auxvec.h

    Move AT_VECTOR_SIZE_ARCH here.

    Other arch headers are created as needed on a per-arch basis."

    Fixed up some conflicts from other header file cleanups and moving code
    around that has happened in the meantime, so David's testing is somewhat
    weakened by that. We'll find out anything that got broken and fix it..

    * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
    Delete all instances of asm/system.h
    Remove all #inclusions of asm/system.h
    Add #includes needed to permit the removal of asm/system.h
    Move all declarations of free_initmem() to linux/mm.h
    Disintegrate asm/system.h for OpenRISC
    Split arch_align_stack() out from asm-generic/system.h
    Split the switch_to() wrapper out of asm-generic/system.h
    Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
    Create asm-generic/barrier.h
    Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
    Disintegrate asm/system.h for Xtensa
    Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
    Disintegrate asm/system.h for Tile
    Disintegrate asm/system.h for Sparc
    Disintegrate asm/system.h for SH
    Disintegrate asm/system.h for Score
    Disintegrate asm/system.h for S390
    Disintegrate asm/system.h for PowerPC
    Disintegrate asm/system.h for PA-RISC
    Disintegrate asm/system.h for MN10300
    ...

    Linus Torvalds
     
  • Remove all #inclusions of asm/system.h preparatory to splitting and killing
    it. Performed with the following command:

    perl -p -i -e 's!^#\s*include\s*.*\n!!' `grep -Irl '^#\s*include\s*' *`

    Signed-off-by: David Howells

    David Howells
     
  • Pull ext4 updates for 3.4 from Ted Ts'o:
    "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

    The changes to export dirty_writeback_interval are from Artem's s_dirt
    cleanup patch series. The same is true of the change to remove the
    s_dirt helper functions which never got used by anyone in-tree. I've
    run these changes by Al Viro, and am carrying them so that Artem can
    more easily fix up the rest of the file systems during the next merge
    window. (Originally we had hopped to remove the use of s_dirt from
    ext4 during this merge window, but his patches had some bugs, so I
    ultimately ended dropping them from the ext4 tree.)"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
    vfs: remove unused superblock helpers
    mm: export dirty_writeback_interval
    ext4: remove useless s_dirt assignment
    ext4: write superblock only once on unmount
    ext4: do not mark superblock as dirty unnecessarily
    ext4: correct ext4_punch_hole return codes
    ext4: remove restrictive checks for EOFBLOCKS_FL
    ext4: always set then trimmed blocks count into len
    ext4: fix trimmed block count accunting
    ext4: fix start and len arguments handling in ext4_trim_fs()
    ext4: update s_free_{inodes,blocks}_count during online resize
    ext4: change some printk() calls to use ext4_msg() instead
    ext4: avoid output message interleaving in ext4_error_()
    ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
    ext4: add no_printk argument validation, fix fallout
    ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
    ext4: give more helpful error message in ext4_ext_rm_leaf()
    ext4: remove unused code from ext4_ext_map_blocks()
    ext4: rewrite punch hole to use ext4_ext_remove_space()
    jbd2: cleanup journal tail after transaction commit
    ...

    Linus Torvalds
     

20 Mar, 2012

1 commit


14 Mar, 2012

4 commits

  • Normally, we have to issue a cache flush before we can update journal tail in
    journal superblock, effectively wiping out old transactions from the journal.
    So use the fact that during transaction commit we issue cache flush anyway and
    opportunistically push journal tail as far as we can. Since update of journal
    superblock is still costly (we have to use WRITE_FUA), we update log tail only
    if we can free significant amount of space.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
    checkpointed buffers are on a stable storage - especially if buffers were
    written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
    caches. Thus when we update journal superblock effectively removing old
    transaction from journal, this write of superblock can get to stable storage
    before those checkpointed buffers which can result in filesystem corruption
    after a crash. Thus we must unconditionally issue a cache flush before we
    update journal superblock in these cases.

    A similar problem can also occur if journal superblock is written only in
    disk's caches, other transaction starts reusing space of the transaction
    cleaned from the log and power failure happens. Subsequent journal replay would
    still try to replay the old transaction but some of it's blocks may be already
    overwritten by the new transaction. For this reason we must use WRITE_FUA when
    updating log tail and we must first write new log tail to disk and update
    in-memory information only after that.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • There are some log tail updates that are not protected by j_checkpoint_mutex.
    Some of these are harmless because they happen during startup or shutdown but
    updates in jbd2_journal_commit_transaction() and jbd2_journal_flush() can
    really race with other log tail updates (e.g. someone doing
    jbd2_journal_flush() with someone running jbd2_cleanup_journal_tail()). So
    protect all log tail updates with j_checkpoint_mutex.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • There are three case of updating journal superblock. In the first case, we want
    to mark journal as empty (setting s_sequence to 0), in the second case we want
    to update log tail, in the third case we want to update s_errno. Split these
    cases into separate functions. It makes the code slightly more straightforward
    and later patches will make the distinction even more important.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

21 Feb, 2012

1 commit

  • There is normally only a handful of these active at any one time, but
    putting them in a separate slab cache makes debugging memory
    corruption problems easier. Manish Katiyar also wanted this make it
    easier to test memory failure scenarios in the jbd2 layer.

    Cc: Manish Katiyar
    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     

29 Dec, 2011

1 commit

  • Currently, we clear revoked flag only when a block is reused. However,
    this can tigger a false journal error. Consider a situation when a block
    is used as a meta block and is deleted(revoked) in ordered mode, then the
    block is allocated as a data block to a file. At this moment, user changes
    the file's journal mode from ordered to journaled and truncates the file.
    The block will be considered re-revoked by journal because it has revoked
    flag still pending from the last transaction and an assertion triggers.

    We fix the problem by keeping the revoked status more uptodate - we clear
    revoked flag when switching revoke tables to reflect there is no revoked
    buffers in current transaction any more.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang