05 Jan, 2020

1 commit

  • [ Upstream commit 015c6033068208d6227612c878877919f3fcf6b6 ]

    jbd2 statistics counting number of blocks logged in a transaction was
    wrong. It didn't count the commit block and more importantly it didn't
    count revoke descriptor blocks. Make sure these get properly counted.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-13-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Jan Kara
     

25 Sep, 2019

1 commit

  • Since ext4/ocfs2 are using jbd2_inode dirty range scoping APIs now,
    jbd2_journal_inode_add_[write|wait] are not used any more, remove them.

    Link: http://lkml.kernel.org/r/1562977611-8412-2-git-send-email-joseph.qi@linux.alibaba.com
    Signed-off-by: Joseph Qi
    Reviewed-by: Ross Zwisler
    Acked-by: Changwei Ge
    Cc: Gang He
    Cc: Joel Becker
    Cc: Joseph Qi
    Cc: Jun Piao
    Cc: Junxiao Bi
    Cc: Mark Fasheh
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

25 Aug, 2019

1 commit


12 Aug, 2019

1 commit

  • When executing generic/388 on a ppc64le machine, we notice the following
    call trace,

    VFS: brelse: Trying to free free buffer
    WARNING: CPU: 0 PID: 6637 at /root/repos/linux/fs/buffer.c:1195 __brelse+0x84/0xc0

    Call Trace:
    __brelse+0x80/0xc0 (unreliable)
    invalidate_bh_lru+0x78/0xc0
    on_each_cpu_mask+0xa8/0x130
    on_each_cpu_cond_mask+0x130/0x170
    invalidate_bh_lrus+0x44/0x60
    invalidate_bdev+0x38/0x70
    ext4_put_super+0x294/0x560
    generic_shutdown_super+0xb0/0x170
    kill_block_super+0x38/0xb0
    deactivate_locked_super+0xa4/0xf0
    cleanup_mnt+0x164/0x1d0
    task_work_run+0x110/0x160
    do_notify_resume+0x414/0x460
    ret_from_except_lite+0x70/0x74

    The warning happens because flush_descriptor() drops bh reference it
    does not own. The bh reference acquired by
    jbd2_journal_get_descriptor_buffer() is owned by the log_bufs list and
    gets released when this list is processed. The reference for doing IO is
    only acquired in write_dirty_buffer() later in flush_descriptor().

    Reported-by: Harish Sriram
    Reviewed-by: Jan Kara
    Signed-off-by: Chandan Rajendra
    Signed-off-by: Theodore Ts'o

    Chandan Rajendra
     

21 Jun, 2019

2 commits

  • The journal_sync_buffer() function was never carried over from jbd to
    jbd2. So get rid of the vestigal declaration of this (non-existent)
    function.

    Signed-off-by: Theodore Ts'o
    Reviewed-by: Darrick J. Wong

    Theodore Ts'o
     
  • Currently both journal_submit_inode_data_buffers() and
    journal_finish_inode_data_buffers() operate on the entire address space
    of each of the inodes associated with a given journal entry. The
    consequence of this is that if we have an inode where we are constantly
    appending dirty pages we can end up waiting for an indefinite amount of
    time in journal_finish_inode_data_buffers() while we wait for all the
    pages under writeback to be written out.

    The easiest way to cause this type of workload is do just dd from
    /dev/zero to a file until it fills the entire filesystem. This can
    cause journal_finish_inode_data_buffers() to wait for the duration of
    the entire dd operation.

    We can improve this situation by scoping each of the inode dirty ranges
    associated with a given transaction. We do this via the jbd2_inode
    structure so that the scoping is contained within jbd2 and so that it
    follows the lifetime and locking rules for that structure.

    This allows us to limit the writeback & wait in
    journal_submit_inode_data_buffers() and
    journal_finish_inode_data_buffers() respectively to the dirty range for
    a given struct jdb2_inode, keeping us from waiting forever if the inode
    in question is still being appended to.

    Signed-off-by: Ross Zwisler
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Ross Zwisler
     

31 May, 2019

2 commits


21 May, 2019

1 commit


11 May, 2019

1 commit

  • When failing from creating cache jbd2_inode_cache, we will destroy the
    previously created cache jbd2_handle_cache twice. This patch fixes
    this by moving each cache initialization/destruction to its own
    separate, individual function.

    Signed-off-by: Chengguang Xu
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Chengguang Xu
     

07 Apr, 2019

2 commits

  • We hit a BUG at fs/buffer.c:3057 if we detached the nbd device
    before unmounting ext4 filesystem.

    The typical chain of events leading to the BUG:
    jbd2_write_superblock
    submit_bh
    submit_bh_wbc
    BUG_ON(!buffer_mapped(bh));

    The block device is removed and all the pages are invalidated. JBD2
    was trying to write journal superblock to the block device which is
    no longer present.

    Fix this by checking the journal superblock's buffer head prior to
    submitting.

    Reported-by: Eric Ren
    Signed-off-by: Jiufei Xue
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@kernel.org

    Jiufei Xue
     
  • At the beginning, nblocks has been assigned. There is no need
    to repeat the assignment in the while loop, and remove it.

    Signed-off-by: Liu Song
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Liu Song
     

01 Mar, 2019

2 commits

  • In jbd2_get_transaction, a new transaction is initialized,
    and set to the j_running_transaction. No need for a return
    value, so remove it.

    Also, adjust some comments to match the actual operation
    of this function.

    Signed-off-by: Liu Song
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Liu Song
     
  • In jbd2_journal_commit_transaction(), if we are in abort mode,
    we may flush the buffer without setting descriptor block checksum
    by goto start_journal_io. Then fs is mounted,
    jbd2_descriptor_block_csum_verify() failed.

    [ 271.379811] EXT4-fs (vdd): shut down requested (2)
    [ 271.381827] Aborting journal on device vdd-8.
    [ 271.597136] JBD2: Invalid checksum recovering block 22199 in log
    [ 271.598023] JBD2: recovery failed
    [ 271.598484] EXT4-fs (vdd): error loading journal

    Fix this problem by keep setting descriptor block checksum if the
    descriptor buffer is not NULL.

    This checksum problem can be reproduced by xfstests generic/388.

    Signed-off-by: luojiajun
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    luojiajun
     

22 Feb, 2019

1 commit

  • The jh pointer may be used uninitialized in the two cases below and the
    compiler complain about it when enabling JBUFFER_TRACE macro, fix them.

    In file included from fs/jbd2/transaction.c:19:0:
    fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’:
    ./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized]
    #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
    ^
    fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here
    struct journal_head *jh;
    ^
    In file included from fs/jbd2/transaction.c:19:0:
    fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’:
    ./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
    ^
    fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here
    struct journal_head *jh;
    ^

    Signed-off-by: zhangyi (F)
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Reviewed-by: Jan Kara

    zhangyi (F)
     

15 Feb, 2019

2 commits

  • The functions jbd2_superblock_csum_verify() and
    jbd2_superblock_csum_set() only get called from one location, so to
    simplify things, fold them into their callers.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • The jbd2 superblock is lockless now, so there is probably a race
    condition between writing it so disk and modifing contents of it, which
    may lead to checksum error. The following race is the one case that we
    have captured.

    jbd2 fsstress
    jbd2_journal_commit_transaction
    jbd2_journal_update_sb_log_tail
    jbd2_write_superblock
    jbd2_superblock_csum_set jbd2_journal_revoke
    jbd2_journal_set_features(revork)
    modify superblock
    submit_bh(checksum incorrect)

    Fix this by locking the buffer head before modifing it. We always
    write the jbd2 superblock after we modify it, so this just means
    calling the lock_buffer() a little earlier.

    This checksum corruption problem can be reproduced by xfstests
    generic/475.

    Reported-by: zhangyi (F)
    Suggested-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

11 Feb, 2019

2 commits

  • We do not unmap and clear dirty flag when forgetting a buffer without
    journal or does not belongs to any transaction, so the invalid dirty
    data may still be written to the disk later. It's fine if the
    corresponding block is never used before the next mount, and it's also
    fine that we invoke clean_bdev_aliases() related functions to unmap
    the block device mapping when re-allocating such freed block as data
    block. But this logic is somewhat fragile and risky that may lead to
    data corruption if we forget to clean bdev aliases. So, It's better to
    discard dirty data during forget time.

    We have been already handled all the cases of forgetting journalled
    buffer, this patch deal with the remaining two cases.

    - buffer is not journalled yet,
    - buffer is journalled but doesn't belongs to any transaction.

    We invoke __bforget() instead of __brelese() when forgetting an
    un-journalled buffer in jbd2_journal_forget(). After this patch we can
    remove all clean_bdev_aliases() related calls in ext4.

    Suggested-by: Jan Kara
    Signed-off-by: zhangyi (F)
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    zhangyi (F)
     
  • Now, we capture a data corruption problem on ext4 while we're truncating
    an extent index block. Imaging that if we are revoking a buffer which
    has been journaled by the committing transaction, the buffer's jbddirty
    flag will not be cleared in jbd2_journal_forget(), so the commit code
    will set the buffer dirty flag again after refile the buffer.

    fsx kjournald2
    jbd2_journal_commit_transaction
    jbd2_journal_revoke commit phase 1~5...
    jbd2_journal_forget
    belongs to older transaction commit phase 6
    jbddirty not clear __jbd2_journal_refile_buffer
    __jbd2_journal_unfile_buffer
    test_clear_buffer_jbddirty
    mark_buffer_dirty

    Finally, if the freed extent index block was allocated again as data
    block by some other files, it may corrupt the file data after writing
    cached pages later, such as during unmount time. (In general,
    clean_bdev_aliases() related helpers should be invoked after
    re-allocation to prevent the above corruption, but unfortunately we
    missed it when zeroout the head of extra extent blocks in
    ext4_ext_handle_unwritten_extents()).

    This patch mark buffer as freed and set j_next_transaction to the new
    transaction when it already belongs to the committing transaction in
    jbd2_journal_forget(), so that commit code knows it should clear dirty
    bits when it is done with the buffer.

    This problem can be reproduced by xfstests generic/455 easily with
    seeds (3246 3247 3248 3249).

    Signed-off-by: zhangyi (F)
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    zhangyi (F)
     

01 Feb, 2019

1 commit

  • This issue was found when I tried to put checkpoint work in a separate thread,
    the deadlock below happened:
    Thread1 | Thread2
    __jbd2_log_wait_for_space |
    jbd2_log_do_checkpoint (hold j_checkpoint_mutex)|
    if (jh->b_transaction != NULL) |
    ... |
    jbd2_log_start_commit(journal, tid); |jbd2_update_log_tail
    | will lock j_checkpoint_mutex,
    | but will be blocked here.
    |
    jbd2_log_wait_commit(journal, tid); |
    wait_event(journal->j_wait_done_commit, |
    !tid_gt(tid, journal->j_commit_sequence)); |
    ... |wake_up(j_wait_done_commit)
    } |

    then deadlock occurs, Thread1 will never be waken up.

    To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()
    when we are going to wait for transaction commit.

    Reviewed-by: Jan Kara
    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Theodore Ts'o

    Xiaoguang Wang
     

04 Dec, 2018

2 commits

  • There is a statement that is indented with spaces, replace it with
    a tab.

    Reviewed-by: Jan Kara
    Signed-off-by: Colin Ian King
    Signed-off-by: Theodore Ts'o

    Colin Ian King
     
  • We can hold j_state_lock for writing at the beginning of
    jbd2_journal_commit_transaction() for a rather long time (reportedly for
    30 ms) due cleaning revoke bits of all revoked buffers under it. The
    handling of revoke tables as well as cleaning of t_reserved_list, and
    checkpoint lists does not need j_state_lock for anything. It is only
    needed to prevent new handles from joining the transaction. Generally
    T_LOCKED transaction state prevents new handles from joining the
    transaction - except for reserved handles which have to allowed to join
    while we wait for other handles to complete.

    To prevent reserved handles from joining the transaction while cleaning
    up lists, add new transaction state T_SWITCH and watch for it when
    starting reserved handles. With this we can just drop the lock for
    operations that don't need it.

    Reported-and-tested-by: Adrian Hunter
    Suggested-by: "Theodore Y. Ts'o"
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

06 Oct, 2018

1 commit

  • The code cleaning transaction's lists of checkpoint buffers has a bug
    where it increases bh refcount only after releasing
    journal->j_list_lock. Thus the following race is possible:

    CPU0 CPU1
    jbd2_log_do_checkpoint()
    jbd2_journal_try_to_free_buffers()
    __journal_try_to_free_buffer(bh)
    ...
    while (transaction->t_checkpoint_io_list)
    ...
    if (buffer_locked(bh)) {

    spin_unlock(&journal->j_list_lock);
    spin_lock(&journal->j_list_lock);
    __jbd2_journal_remove_checkpoint(jh);
    spin_unlock(&journal->j_list_lock);
    try_to_free_buffers(page);
    get_bh(bh) j_list_lock.

    Fixes: dc6e8d669cf5 ("jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()")
    Fixes: be1158cc615f ("jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()")
    Reported-by: syzbot+7f4a27091759e2fe7453@syzkaller.appspotmail.com
    CC: stable@vger.kernel.org
    Reviewed-by: Lukas Czerner
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

30 Jul, 2018

1 commit

  • jbd2 is one of the few callers of current_kernel_time64(), which
    is a wrapper around ktime_get_coarse_real_ts64(). This calls the
    latter directly for consistency with the rest of the kernel that
    is moving to the ktime_get_ family of time accessors.

    Reviewed-by: Andreas Dilger
    Reviewed-by: Jan Kara
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Theodore Ts'o

    Arnd Bergmann
     

09 Jul, 2018

1 commit

  • Pull ext4 bugfixes from Ted Ts'o:
    "Bug fixes for ext4; most of which relate to vulnerabilities where a
    maliciously crafted file system image can result in a kernel OOPS or
    hang.

    At least one fix addresses an inline data bug could be triggered by
    userspace without the need of a crafted file system (although it does
    require that the inline data feature be enabled)"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: check superblock mapped prior to committing
    ext4: add more mount time checks of the superblock
    ext4: add more inode number paranoia checks
    ext4: avoid running out of journal credits when appending to an inline file
    jbd2: don't mark block as modified if the handle is out of credits
    ext4: never move the system.data xattr out of the inode body
    ext4: clear i_data in ext4_inode_info when removing inline data
    ext4: include the illegal physical block in the bad map ext4_error msg
    ext4: verify the depth of extent tree in ext4_find_extent()
    ext4: only look at the bg_flags field if it is valid
    ext4: make sure bitmaps and the inode table don't overlap with bg descriptors
    ext4: always check block group bounds in ext4_init_block_bitmap()
    ext4: always verify the magic number in xattr blocks
    ext4: add corruption check in ext4_xattr_set_entry()
    ext4: add warn_on_error mount option

    Linus Torvalds
     

17 Jun, 2018

1 commit

  • Do not set the b_modified flag in block's journal head should not
    until after we're sure that jbd2_journal_dirty_metadat() will not
    abort with an error due to there not being enough space reserved in
    the jbd2 handle.

    Otherwise, future attempts to modify the buffer may lead a large
    number of spurious errors and warnings.

    This addresses CVE-2018-10883.

    https://bugzilla.kernel.org/show_bug.cgi?id=200071

    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Theodore Ts'o
     

13 Jun, 2018

1 commit

  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

21 May, 2018

2 commits

  • The kmem_cache_destroy() function already checks for null pointers, so
    we can remove the check at the call site.

    This patch also sets jbd2_handle_cache and jbd2_inode_cache to be NULL
    after freeing them in jbd2_journal_destroy_handle_cache().

    Signed-off-by: Wang Long
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Wang Long
     
  • See following dmesg output with jbd2 debug enabled:

    ...(start_this_handle, 313): New handle 00000000c88d6ceb going live.

    ...(start_this_handle, 383): Handle 00000000c88d6ceb given 53 credits (total 53, free 32681)

    ...(do_get_write_access, 838): journal_head 0000000002856fc0, force_copy 0

    ...(jbd2_journal_cancel_revoke, 421): journal_head 0000000002856fc0, cancelling revoke

    We have an extra line with every messages, this is a waste of buffer,
    we can fix it by removing "\n" in the caller or remove it in
    the __jbd2_debug(), i checked every jbd2_debug() passed '\n' explicitly.

    To avoid more lines, let's remove it inside __jbd2_debug().

    Signed-off-by: Wang Shilong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Wang Shilong
     

18 Apr, 2018

1 commit

  • If ext4 tries to start a reserved handle via
    jbd2_journal_start_reserved(), and the journal has been aborted, this
    can result in a NULL pointer dereference. This is because the fields
    h_journal and h_transaction in the handle structure share the same
    memory, via a union, so jbd2_journal_start_reserved() will clear
    h_journal before calling start_this_handle(). If this function fails
    due to an aborted handle, h_journal will still be NULL, and the call
    to jbd2_journal_free_reserved() will pass a NULL journal to
    sub_reserve_credits().

    This can be reproduced by running "kvm-xfstests -c dioread_nolock
    generic/475".

    Cc: stable@kernel.org # 3.11
    Fixes: 8f7d89f36829b ("jbd2: transaction reservation support")
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Andreas Dilger
    Reviewed-by: Jan Kara

    Theodore Ts'o
     

20 Feb, 2018

1 commit


19 Feb, 2018

2 commits

  • Previously the jbd2 layer assumed that a file system check would be
    required after a journal abort. In the case of the deliberate file
    system shutdown, this should not be necessary. Allow the jbd2 layer
    to distinguish between these two cases by using the ESHUTDOWN errno.

    Also add proper locking to __journal_abort_soft().

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • There were two error messages emitted by jbd2, one for a bad checksum
    for a jbd2 descriptor block, and one for a bad checksum for a jbd2
    data block. Change the data block checksum error so that the two can
    be disambiguated.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

10 Jan, 2018

1 commit

  • Sphinx emits various (26) warnings when building make target 'htmldocs'.
    Currently struct definitions contain duplicate documentation, some as
    kernel-docs and some as standard c89 comments. We can reduce
    duplication while cleaning up the kernel docs.

    Move all kernel-docs to right above each struct member. Use the set of
    all existing comments (kernel-doc and c89). Add documentation for
    missing struct members and function arguments.

    Signed-off-by: Tobin C. Harding
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Tobin C. Harding
     

18 Dec, 2017

1 commit

  • A number of ext4 source files were skipped due because their copyright
    permission statements didn't match the expected text used by the
    automated conversion utilities. I've added SPDX tags for the rest.

    While looking at some of these files, I've noticed that we have quite
    a bit of variation on the licenses that were used --- in particular
    some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
    and we have some files that have a LGPL-2.1 license (which was quite
    surprising).

    I've not attempted to do any license changes. Even if it is perfectly
    legal to relicense to GPL 2.0-only for consistency's sake, that should
    be done with ext4 developer community discussion.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

18 Nov, 2017

1 commit

  • Pull libnvdimm and dax updates from Dan Williams:
    "Save for a few late fixes, all of these commits have shipped in -next
    releases since before the merge window opened, and 0day has given a
    build success notification.

    The ext4 touches came from Jan, and the xfs touches have Darrick's
    reviewed-by. An xfstest for the MAP_SYNC feature has been through
    a few round of reviews and is on track to be merged.

    - Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
    'userspace flush' of persistent memory updates via filesystem-dax
    mappings. It arranges for any filesystem metadata updates that may
    be required to satisfy a write fault to also be flushed ("on disk")
    before the kernel returns to userspace from the fault handler.
    Effectively every write-fault that dirties metadata completes an
    fsync() before returning from the fault handler. The new
    MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
    is validated as supported by the filesystem's ->mmap() file
    operation.

    - Add support for the standard ACPI 6.2 label access methods that
    replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
    This enables interoperability with environments that only implement
    the standardized methods.

    - Add support for the ACPI 6.2 NVDIMM media error injection methods.

    - Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
    latch last shutdown status, firmware update, SMART error injection,
    and SMART alarm threshold control.

    - Cleanup physical address information disclosures to be root-only.

    - Fix revalidation of the DIMM "locked label area" status to support
    dynamic unlock of the label area.

    - Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
    (system-physical-address) command and error injection commands.

    Acknowledgements that came after the commits were pushed to -next:

    - 957ac8c421ad ("dax: fix PMD faults on zero-length files"):
    Reviewed-by: Ross Zwisler

    - a39e596baa07 ("xfs: support for synchronous DAX faults") and
    7b565c9f965b ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
    Reviewed-by: Darrick J. Wong "

    * tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
    acpi, nfit: add 'Enable Latch System Shutdown Status' command support
    dax: fix general protection fault in dax_alloc_inode
    dax: fix PMD faults on zero-length files
    dax: stop requiring a live device for dax_flush()
    brd: remove dax support
    dax: quiet bdev_dax_supported()
    fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
    tools/testing/nvdimm: unit test clear-error commands
    acpi, nfit: validate commands against the device type
    tools/testing/nvdimm: stricter bounds checking for error injection commands
    xfs: support for synchronous DAX faults
    xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
    ext4: Support for synchronous DAX faults
    ext4: Simplify error handling in ext4_dax_huge_fault()
    dax: Implement dax_finish_sync_fault()
    dax, iomap: Add support for synchronous faults
    mm: Define MAP_SYNC and VM_SYNC flags
    dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
    dax: Allow dax_iomap_fault() to return pfn
    dax: Fix comment describing dax_iomap_fault()
    ...

    Linus Torvalds
     

03 Nov, 2017

1 commit

  • We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to
    prepare blocks for writing and the inode has some uncommitted metadata
    changes. In the fault handler ext4_dax_fault() we then detect this case
    (through VM_FAULT_NEEDDSYNC return value) and call helper
    dax_finish_sync_fault() to flush metadata changes and insert page table
    entry. Note that this will also dirty corresponding radix tree entry
    which is what we want - fsync(2) will still provide data integrity
    guarantees for applications not using userspace flushing. And
    applications using userspace flushing can avoid calling fsync(2) and
    thus avoid the performance overhead.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Dan Williams

    Jan Kara
     

19 Oct, 2017

1 commit

  • In preparation for unconditionally passing the struct timer_list pointer to
    all timer callbacks, switch to using the new timer_setup() and from_timer()
    to pass the timer pointer explicitly.

    Signed-off-by: Kees Cook
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: linux-ext4@vger.kernel.org
    Cc: Thomas Gleixner

    Kees Cook
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

06 Jul, 2017

1 commit

  • Resetting this flag is almost certainly racy, and will be problematic
    with some coming changes.

    Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
    Have jbd2 call it instead of filemap_fdatawait and don't attempt to
    re-set the error flag if it fails.

    Reviewed-by: Jan Kara
    Reviewed-by: Carlos Maiolino
    Signed-off-by: Jeff Layton

    Jeff Layton