01 Mar, 2020

1 commit

  • journal_head::b_transaction and journal_head::b_next_transaction could
    be accessed concurrently as noticed by KCSAN,

    LTP: starting fsync04
    /dev/zero: Can't open blockdev
    EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
    EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
    ==================================================================
    BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]

    write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
    __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
    __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
    jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
    (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
    kjournald2+0x13b/0x450 [jbd2]
    kthread+0x1cd/0x1f0
    ret_from_fork+0x27/0x50

    read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
    jbd2_write_access_granted+0x1b2/0x250 [jbd2]
    jbd2_write_access_granted at fs/jbd2/transaction.c:1155
    jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
    __ext4_journal_get_write_access+0x50/0x90 [ext4]
    ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
    ext4_mb_new_blocks+0x54f/0xca0 [ext4]
    ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
    ext4_map_blocks+0x3b4/0x950 [ext4]
    _ext4_get_block+0xfc/0x270 [ext4]
    ext4_get_block+0x3b/0x50 [ext4]
    __block_write_begin_int+0x22e/0xae0
    __block_write_begin+0x39/0x50
    ext4_write_begin+0x388/0xb50 [ext4]
    generic_perform_write+0x15d/0x290
    ext4_buffered_write_iter+0x11f/0x210 [ext4]
    ext4_file_write_iter+0xce/0x9e0 [ext4]
    new_sync_write+0x29c/0x3b0
    __vfs_write+0x92/0xa0
    vfs_write+0x103/0x260
    ksys_write+0x9d/0x130
    __x64_sys_write+0x4c/0x60
    do_syscall_64+0x91/0xb05
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    5 locks held by fsync04/25724:
    #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
    #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
    #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
    #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
    #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
    irq event stamp: 1407125
    hardirqs last enabled at (1407125): [] __find_get_block+0x107/0x790
    hardirqs last disabled at (1407124): [] __find_get_block+0x49/0x790
    softirqs last enabled at (1405528): [] __do_softirq+0x34c/0x57c
    softirqs last disabled at (1405521): [] irq_exit+0xa2/0xc0

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    The plain reads are outside of jh->b_state_lock critical section which result
    in data races. Fix them by adding pairs of READ|WRITE_ONCE().

    Reviewed-by: Jan Kara
    Signed-off-by: Qian Cai
    Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
    Signed-off-by: Theodore Ts'o

    Qian Cai
     

22 Feb, 2020

1 commit

  • I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
    The running environment:
    kernel version: 4.19
    A cluster with two nodes, 5 luns mounted on two nodes, and do some
    file operations like dd/fallocate/truncate/rm on every lun with storage
    network disconnection.

    The fallocate operation on dm-23-45 caused an null pointer dereference.

    The information of NULL pointer dereference as follows:
    [577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45.
    [577992.878290] Aborting journal on device dm-23-45.
    ...
    [577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46.
    [577992.890908] __journal_remove_journal_head: freeing b_committed_data
    [577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30
    [577992.890918] __journal_remove_journal_head: freeing b_committed_data
    [577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30
    [577992.890922] __journal_remove_journal_head: freeing b_committed_data
    [577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30
    [577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30
    [577992.890928] __journal_remove_journal_head: freeing b_committed_data
    [577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30
    [577992.890933] __journal_remove_journal_head: freeing b_committed_data
    [577992.890939] __journal_remove_journal_head: freeing b_committed_data
    [577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
    [577992.890950] Mem abort info:
    [577992.890951] ESR = 0x96000004
    [577992.890952] Exception class = DABT (current EL), IL = 32 bits
    [577992.890952] SET = 0, FnV = 0
    [577992.890953] EA = 0, S1PTW = 0
    [577992.890954] Data abort info:
    [577992.890955] ISV = 0, ISS = 0x00000004
    [577992.890956] CM = 0, WnR = 0
    [577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9
    [577992.890960] [0000000000000020] pgd=0000000000000000
    [577992.890964] Internal error: Oops: 96000004 [#1] SMP
    [577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd)
    [577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G W OE 4.19.36 #1
    [577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
    [577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO)
    [577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
    [577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2]
    [577992.891084] sp : ffff0000c8e2b810
    [577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000
    [577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70
    [577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2
    [577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30
    [577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000
    [577992.891097] x19: ffff000001681638 x18: ffffffffffffffff
    [577992.891098] x17: 0000000000000000 x16: ffff000080a03df0
    [577992.891100] x15: ffff0000811d9708 x14: 203d207375746174
    [577992.891101] x13: 73203a524f525245 x12: 20373439343a6565
    [577992.891103] x11: 0000000000000038 x10: 0101010101010101
    [577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f
    [577992.891109] x7 : 0000000000000000 x6 : 0000000000000080
    [577992.891110] x5 : 0000000000000000 x4 : 0000000000000002
    [577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00
    [577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000
    [577992.891116] Call trace:
    [577992.891139] _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
    [577992.891162] _ocfs2_free_clusters+0x100/0x290 [ocfs2]
    [577992.891185] ocfs2_free_clusters+0x50/0x68 [ocfs2]
    [577992.891206] ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2]
    [577992.891227] ocfs2_add_inode_data+0x94/0xc8 [ocfs2]
    [577992.891248] ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2]
    [577992.891269] ocfs2_allocate_extents+0x14c/0x338 [ocfs2]
    [577992.891290] __ocfs2_change_file_space+0x3f8/0x610 [ocfs2]
    [577992.891309] ocfs2_fallocate+0xe4/0x128 [ocfs2]
    [577992.891316] vfs_fallocate+0x11c/0x250
    [577992.891317] ksys_fallocate+0x54/0x88
    [577992.891319] __arm64_sys_fallocate+0x28/0x38
    [577992.891323] el0_svc_common+0x78/0x130
    [577992.891325] el0_svc_handler+0x38/0x78
    [577992.891327] el0_svc+0x8/0xc

    My analysis process as follows:
    ocfs2_fallocate
    __ocfs2_change_file_space
    ocfs2_allocate_extents
    ocfs2_extend_allocation
    ocfs2_add_inode_data
    ocfs2_add_clusters_in_btree
    ocfs2_insert_extent
    ocfs2_do_insert_extent
    ocfs2_rotate_tree_right
    ocfs2_extend_rotate_transaction
    ocfs2_extend_trans
    jbd2_journal_restart
    jbd2__journal_restart
    /* handle->h_transaction is NULL,
    * is_handle_aborted(handle) is true
    */
    handle->h_transaction = NULL;
    start_this_handle
    return -EROFS;
    ocfs2_free_clusters
    _ocfs2_free_clusters
    _ocfs2_free_suballoc_bits
    ocfs2_block_group_clear_bits
    ocfs2_journal_access_gd
    __ocfs2_journal_access
    jbd2_journal_get_undo_access
    /* I think jbd2_write_access_granted() will
    * return true, because do_get_write_access()
    * will return -EROFS.
    */
    if (jbd2_write_access_granted(...)) return 0;
    do_get_write_access
    /* handle->h_transaction is NULL, it will
    * return -EROFS here, so do_get_write_access()
    * was not called.
    */
    if (is_handle_aborted(handle)) return -EROFS;
    /* bh2jh(group_bh) is NULL, caused NULL
    pointer dereference */
    undo_bg = (struct ocfs2_group_desc *)
    bh2jh(group_bh)->b_committed_data;

    If handle->h_transaction == NULL, then jbd2_write_access_granted()
    does not really guarantee that journal_head will stay around,
    not even speaking of its b_committed_data. The bh2jh(group_bh)
    can be removed after ocfs2_journal_access_gd() and before call
    "bh2jh(group_bh)->b_committed_data". So, we should move
    is_handle_aborted() check from do_get_write_access() into
    jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
    before the call to jbd2_write_access_granted().

    Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.com
    Signed-off-by: Yan Wang
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jun Piao
    Reviewed-by: Jan Kara
    Cc: stable@kernel.org

    wangyan
     

17 Feb, 2020

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "Miscellaneous ext4 bug fixes (all stable fodder)"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: improve explanation of a mount failure caused by a misconfigured kernel
    jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer
    jbd2: move the clearing of b_modified flag to the journal_unmap_buffer()
    ext4: add cond_resched() to ext4_protect_reserved_inode
    ext4: fix checksum errors with indexed dirs
    ext4: fix support for inode sizes > 1024 bytes
    ext4: simplify checking quota limits in ext4_statfs()
    ext4: don't assume that mmp_nodename/bdevname have NUL

    Linus Torvalds
     

14 Feb, 2020

2 commits

  • Commit 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from
    an older transaction") set the BH_Freed flag when forgetting a metadata
    buffer which belongs to the committing transaction, it indicate the
    committing process clear dirty bits when it is done with the buffer. But
    it also clear the BH_Mapped flag at the same time, which may trigger
    below NULL pointer oops when block_size < PAGE_SIZE.

    rmdir 1 kjournald2 mkdir 2
    jbd2_journal_commit_transaction
    commit transaction N
    jbd2_journal_forget
    set_buffer_freed(bh1)
    jbd2_journal_commit_transaction
    commit transaction N+1
    ...
    clear_buffer_mapped(bh1)
    ext4_getblk(bh2 ummapped)
    ...
    grow_dev_page
    init_page_buffers
    bh1->b_private=NULL
    bh2->b_private=NULL
    jbd2_journal_put_journal_head(jh1)
    __journal_remove_journal_head(hb1)
    jh1 is NULL and trigger oops

    *) Dir entry block bh1 and bh2 belongs to one page, and the bh2 has
    already been unmapped.

    For the metadata buffer we forgetting, we should always keep the mapped
    flag and clear the dirty flags is enough, so this patch pick out the
    these buffers and keep their BH_Mapped flag.

    Link: https://lore.kernel.org/r/20200213063821.30455-3-yi.zhang@huawei.com
    Fixes: 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from an older transaction")
    Reviewed-by: Jan Kara
    Signed-off-by: zhangyi (F)
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    zhangyi (F)
     
  • There is no need to delay the clearing of b_modified flag to the
    transaction committing time when unmapping the journalled buffer, so
    just move it to the journal_unmap_buffer().

    Link: https://lore.kernel.org/r/20200213063821.30455-2-yi.zhang@huawei.com
    Reviewed-by: Jan Kara
    Signed-off-by: zhangyi (F)
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    zhangyi (F)
     

09 Feb, 2020

1 commit

  • Pull misc vfs updates from Al Viro:

    - bmap series from cmaiolino

    - getting rid of convolutions in copy_mount_options() (use a couple of
    copy_from_user() instead of the __get_user() crap)

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    saner copy_mount_options()
    fibmap: Reject negative block numbers
    fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
    ecryptfs: drop direct calls to ->bmap
    cachefiles: drop direct usage of ->bmap method.
    fs: Enable bmap() function to properly return errors

    Linus Torvalds
     

04 Feb, 2020

1 commit

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

03 Feb, 2020

1 commit

  • By now, bmap() will either return the physical block number related to
    the requested file offset or 0 in case of error or the requested offset
    maps into a hole.
    This patch makes the needed changes to enable bmap() to proper return
    errors, using the return value as an error return, and now, a pointer
    must be passed to bmap() to be filled with the mapped physical block.

    It will change the behavior of bmap() on return:

    - negative value in case of error
    - zero on success or map fell into a hole

    In case of a hole, the *block will be zero too

    Since this is a prep patch, by now, the only error return is -EINVAL if
    ->bmap doesn't exist.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Al Viro

    Carlos Maiolino
     

25 Jan, 2020

8 commits

  • __jbd2_journal_abort_hard() is no longer used, so now we can merge
    __jbd2_journal_abort_hard() and __journal_abort_soft() these two
    functions into jbd2_journal_abort() and remove them.

    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20191204124614.45424-5-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • Commit fb7c02445c49 ("ext4: pass -ESHUTDOWN code to jbd2 layer") want
    to allow jbd2 layer to distinguish shutdown journal abort from other
    error cases. So the ESHUTDOWN should be taken precedence over any other
    errno which has already been recoded after EXT4_FLAGS_SHUTDOWN is set,
    but it only update errno in the journal suoerblock now if the old errno
    is 0.

    Fixes: fb7c02445c49 ("ext4: pass -ESHUTDOWN code to jbd2 layer")
    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20191204124614.45424-4-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • JBD2_REC_ERR flag used to indicate the errno has been updated when jbd2
    aborted, and then __ext4_abort() and ext4_handle_error() can invoke
    panic if ERRORS_PANIC is specified. But if the journal has been aborted
    with zero errno, jbd2_journal_abort() didn't set this flag so we can
    no longer panic. Fix this by always record the proper errno in the
    journal superblock.

    Fixes: 4327ba52afd03 ("ext4, jbd2: ensure entering into panic after recording an error in superblock")
    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20191204124614.45424-3-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • We invoke jbd2_journal_abort() to abort the journal and record errno
    in the jbd2 superblock when committing journal transaction besides the
    failure on submitting the commit record. But there is no need for the
    case and we can also invoke jbd2_journal_abort() instead of
    __jbd2_journal_abort_hard().

    Fixes: 818d276ceb83a ("ext4: Add the journal checksum feature")
    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20191204124614.45424-2-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • if seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output.

    Script below generates endless output
    $ q=;while read -r r;do echo "$((++q)) $r";done
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/d13805e5-695e-8ac3-b678-26ca2313629f@virtuozzo.com
    Signed-off-by: Theodore Ts'o

    Vasily Averin
     
  • Only when jh->b_jcount = 0 in jbd2_journal_put_journal_head, we are allowed
    to call __journal_remove_journal_head. This assertion is meaningless,
    just remove it.

    Signed-off-by: Shijie Luo
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20200123070054.50585-1-luoshijie1@huawei.com
    Signed-off-by: Theodore Ts'o

    Shijie Luo
     
  • Fix comment and remove unneccessary blank.

    Signed-off-by: Shijie Luo
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20200123064325.36358-1-luoshijie1@huawei.com
    Signed-off-by: Theodore Ts'o

    Shijie Luo
     
  • Delete the duplicated words "is" in the comments

    Signed-off-by: Yan Wang
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/12087f77-ab4d-c7ba-53b4-893dbf0026f0@huawei.com
    Signed-off-by: Theodore Ts'o

    wangyan
     

18 Jan, 2020

1 commit

  • If the journal is dirty when the filesystem is mounted, jbd2 will replay
    the journal but the journal superblock will not be updated by
    journal_reset() because JBD2_ABORT flag is still set (it was set in
    journal_init_common()). This is problematic because when a new transaction
    is then committed, it will be recorded in block 1 (journal->j_tail was set
    to 1 in journal_reset()). If unclean shutdown happens again before the
    journal superblock is updated, the new recorded transaction will not be
    replayed during the next mount (because of stale sb->s_start and
    sb->s_sequence values) which can lead to filesystem corruption.

    Fixes: 85e0c4e89c1b ("jbd2: if the journal is aborted then don't allow update of the log tail")
    Signed-off-by: Kai Li
    Link: https://lore.kernel.org/r/20200111022542.5008-1-li.kai4@h3c.com
    Signed-off-by: Theodore Ts'o

    Kai Li
     

01 Dec, 2019

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "This merge window saw the the following new featuers added to ext4:

    - Direct I/O via iomap (required the iomap-for-next branch from
    Darrick as a prereq).

    - Support for using dioread-nolock where the block size < page size.

    - Support for encryption for file systems where the block size < page
    size.

    - Rework of journal credits handling so a revoke-heavy workload will
    not cause the journal to run out of space.

    - Replace bit-spinlocks with spinlocks in jbd2

    Also included were some bug fixes and cleanups, mostly to clean up
    corner cases from fuzzed file systems and error path handling"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (59 commits)
    ext4: work around deleting a file with i_nlink == 0 safely
    ext4: add more paranoia checking in ext4_expand_extra_isize handling
    jbd2: make jbd2_handle_buffer_credits() handle reserved handles
    ext4: fix a bug in ext4_wait_for_tail_page_commit
    ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails
    ext4: code cleanup for get_next_id
    ext4: fix leak of quota reservations
    ext4: remove unused variable warning in parse_options()
    ext4: Enable encryption for subpage-sized blocks
    fs/buffer.c: support fscrypt in block_read_full_page()
    ext4: Add error handling for io_end_vec struct allocation
    jbd2: Fine tune estimate of necessary descriptor blocks
    jbd2: Provide trace event for handle restarts
    ext4: Reserve revoke credits for freed blocks
    jbd2: Make credit checking more strict
    jbd2: Rename h_buffer_credits to h_total_credits
    jbd2: Reserve space for revoke descriptor blocks
    jbd2: Drop jbd2_space_needed()
    jbd2: Account descriptor blocks into t_outstanding_credits
    jbd2: Factor out common parts of stopping and restarting a handle
    ...

    Linus Torvalds
     

06 Nov, 2019

15 commits

  • Theodore Ts'o
     
  • Currently we reserve j_max_transaction_buffers / 32 for transaction
    descriptor blocks. Now that revoke descriptors are accounted for
    separately this estimate is unnecessarily high and we can actually
    compute much tighter estimate. In the common case of 32k journal blocks
    and 4k blocksize this actually reduces the amount of reserved descriptor
    blocks from 256 to ~25 which allows us to fit more real data into a
    transaction.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-25-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Provide trace event for handle restarts to ease debugging.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-24-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Make checking of available credits in jbd2_journal_dirty_metadata() more
    strict. There should be always enough credits in the handle to write all
    potential revoke descriptors. Also we warn in case there are not enough
    credits since this is a bug in the filesystem.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-22-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • The credit counter now contains both buffer and revoke descriptor block
    credits. Rename to counter to h_total_credits to reflect that. No
    functional change.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-21-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Extend functions for starting, extending, and restarting transaction
    handles to take number of revoke records handle must be able to
    accommodate. These functions then make sure transaction has enough
    credits to be able to store resulting revoke descriptor blocks. Also
    revoke code tracks number of revoke records created by a handle to catch
    situation where some place didn't reserve enough space for revoke
    records. Similarly to standard transaction credits, space for unused
    reserved revoke records is released when the handle is stopped.

    On the ext4 side we currently take a simplistic approach of reserving
    space for 1024 revoke records for any transaction. This grows amount of
    credits reserved for each handle only by a few and is enough for any
    normal workload so that we don't hit warnings in jbd2. We will refine
    the logic in following commits.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-20-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • The function is now just a trivial wrapper returning
    journal->j_max_transaction_buffers. Drop it.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-19-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently, journal descriptor blocks were not accounted in
    transaction->t_outstanding_credits and we were just leaving some slack
    space in the journal for them (in jbd2_log_space_left() and
    jbd2_space_needed()). This is making proper accounting (and reservation
    we want to add) of descriptor blocks difficult so switch to accounting
    descriptor blocks in transaction->t_outstanding_credits and just reserve
    the same amount of credits in t_outstanding credits for journal
    descriptor blocks when creating transaction.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-18-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • jbd2__journal_restart() has quite some code that is common with
    jbd2_journal_stop(). Factor this functionality into stop_this_handle()
    helper and use it from both functions. Note that this also drops
    t_handle_lock protection from jbd2__journal_restart() as
    jbd2_journal_stop() does the same thing without it.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-17-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When we drop last handle from a transaction and journal->j_barrier_count
    > 0, jbd2_journal_stop() wakes up journal->j_wait_transaction_locked
    wait queue. This looks pointless - wait for outstanding handles always
    happens on journal->j_wait_updates waitqueue.
    journal->j_wait_transaction_locked is used to wait for transaction state
    changes and by start_this_handle() for waiting until
    journal->j_barrier_count drops to 0. The first case is clearly
    irrelevant here since only jbd2 thread changes transaction state. The
    second case looks related but jbd2_journal_unlock_updates() is
    responsible for the wakeup in this case. So just drop the wakeup.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-16-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • If a transaction is larger than journal->j_max_transaction_buffers, that
    is a bug and not a trigger for transaction commit. Also the very next
    attempt to start new handle will start transaction commit anyway. So
    just remove the pointless check. Arguably, we could start transaction
    commit whenever the transaction size is *close* to
    journal->j_max_transaction_buffers. This has a potential to reduce
    latency of the next jbd2_journal_start() at the cost of somewhat smaller
    transactions. However for this to have any effect, it would mean that
    there isn't someone already waiting in jbd2_journal_start() which means
    metadata load for the fs is pretty light anyway so probably this
    optimization is not worth it.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-15-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Move code in jbd2_journal_stop() around a bit. It removes some
    unnecessary code duplication and will make factoring out parts common
    with jbd2__journal_restart() easier.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-14-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • jbd2 statistics counting number of blocks logged in a transaction was
    wrong. It didn't count the commit block and more importantly it didn't
    count revoke descriptor blocks. Make sure these get properly counted.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-13-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • With 32-bit block numbers, we don't allocate the array for journal
    buffer heads large enough for corresponding descriptor tags to fill the
    descriptor block. Thus we end up writing out half-full descriptor blocks
    to the journal unnecessarily growing the transaction. Fix the logic to
    allocate the array large enough.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-3-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • jbd2_journal_next_log_block() does not look at
    transaction->t_outstanding_credits. Remove the misleading comment.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-2-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

21 Oct, 2019

6 commits

  • On PREEMPT_RT bit-spinlocks have the same semantics as on PREEMPT_RT=n,
    i.e. they disable preemption. That means functions which are not safe to be
    called in preempt disabled context on RT trigger a might_sleep() assert.

    The journal head bit spinlock is mostly held for short code sequences with
    trivial RT safe functionality, except for one place:

    jbd2_journal_put_journal_head() invokes __journal_remove_journal_head()
    with the journal head bit spinlock held. __journal_remove_journal_head()
    invokes kmem_cache_free() which must not be called with preemption disabled
    on RT.

    Jan suggested to rework the removal function so the actual free happens
    outside the bit-spinlocked region.

    Split it into two parts:

    - Do the sanity checks and the buffer head detach under the lock

    - Do the actual free after dropping the lock

    There is error case handling in the free part which needs to dereference
    the b_size field of the now detached buffer head. Due to paranoia (caused
    by ignorance) the size is retrieved in the detach function and handed into
    the free function. Might be over-engineered, but better safe than sorry.

    This makes the journal head bit-spinlock usage RT compliant and also avoids
    nested locking which is not covered by lockdep.

    Suggested-by: Jan Kara
    Signed-off-by: Thomas Gleixner
    Cc: linux-ext4@vger.kernel.org
    Cc: "Theodore Ts'o"
    Cc: Jan Kara
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20190809124233.13277-8-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Thomas Gleixner
     
  • Bit-spinlocks are problematic on PREEMPT_RT if functions which might sleep
    on RT, e.g. spin_lock(), alloc/free(), are invoked inside the lock held
    region because bit spinlocks disable preemption even on RT.

    A first attempt was to replace state lock with a spinlock placed in struct
    buffer_head and make the locking conditional on PREEMPT_RT and
    DEBUG_BIT_SPINLOCKS.

    Jan pointed out that there is a 4 byte hole in struct journal_head where a
    regular spinlock fits in and he would not object to convert the state lock
    to a spinlock unconditionally.

    Aside of solving the RT problem, this also gains lockdep coverage for the
    journal head state lock (bit-spinlocks are not covered by lockdep as it's
    hard to fit a lockdep map into a single bit).

    The trivial change would have been to convert the jbd_*lock_bh_state()
    inlines, but that comes with the downside that these functions take a
    buffer head pointer which needs to be converted to a journal head pointer
    which adds another level of indirection.

    As almost all functions which use this lock have a journal head pointer
    readily available, it makes more sense to remove the lock helper inlines
    and write out spin_*lock() at all call sites.

    Fixup all locking comments as well.

    Suggested-by: Jan Kara
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joseph Qi
    Cc: Joel Becker
    Cc: Jan Kara
    Cc: linux-ext4@vger.kernel.org
    Link: https://lore.kernel.org/r/20190809124233.13277-7-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Thomas Gleixner
     
  • jbd2_journal_forget() jumps to 'not_jbd' branch which calls __bforget()
    in cases where the buffer is clean which is pointless. In case of failed
    assertion, it can be even argued that it is safer not to touch buffer's
    dirty bits. Also logically it makes more sense to just jump to 'drop'
    and that will make logic also simpler when we switch bh_state_lock to a
    spinlock.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20190809124233.13277-6-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • We have cleared both dirty & jbddirty bits from the bh. So there's no
    difference between bforget() and brelse(). Thus there's no point jumping
    to no_jbd branch.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20190809124233.13277-5-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • __jbd2_journal_unfile_buffer() and __jbd2_journal_refile_buffer() drop
    transaction's jh reference when they remove jh from a transaction. This
    will be however inconvenient once we move state lock into journal_head
    itself as we still need to unlock it and we'd need to grab jh reference
    just for that. Move dropping of jh reference out of these functions into
    the few callers.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20190809124233.13277-4-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • journal_unmap_buffer() checks first whether the buffer head is a journal.
    If so it takes locks and then invokes jbd2_journal_grab_journal_head()
    followed by another check whether this is journal head buffer.

    The double checking is pointless.

    Replace the initial check with jbd2_journal_grab_journal_head() which
    alredy checks whether the buffer head is actually a journal.

    Allows also early access to the journal head pointer for the upcoming
    conversion of state lock to a regular spinlock.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Jan Kara
    Cc: linux-ext4@vger.kernel.org
    Cc: "Theodore Ts'o"
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20190809124233.13277-2-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Thomas Gleixner
     

09 Oct, 2019

1 commit

  • Since the following commit:

    b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")

    @nested is no longer used in lock_release(), so remove it from all
    lock_release() calls and friends.

    Signed-off-by: Qian Cai
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Acked-by: Daniel Vetter
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: airlied@linux.ie
    Cc: akpm@linux-foundation.org
    Cc: alexander.levin@microsoft.com
    Cc: daniel@iogearbox.net
    Cc: davem@davemloft.net
    Cc: dri-devel@lists.freedesktop.org
    Cc: duyuyang@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: hannes@cmpxchg.org
    Cc: intel-gfx@lists.freedesktop.org
    Cc: jack@suse.com
    Cc: jlbec@evilplan.or
    Cc: joonas.lahtinen@linux.intel.com
    Cc: joseph.qi@linux.alibaba.com
    Cc: jslaby@suse.com
    Cc: juri.lelli@redhat.com
    Cc: maarten.lankhorst@linux.intel.com
    Cc: mark@fasheh.com
    Cc: mhocko@kernel.org
    Cc: mripard@kernel.org
    Cc: ocfs2-devel@oss.oracle.com
    Cc: rodrigo.vivi@intel.com
    Cc: sean@poorly.run
    Cc: st@kernel.org
    Cc: tj@kernel.org
    Cc: tytso@mit.edu
    Cc: vdavydov.dev@gmail.com
    Cc: vincent.guittot@linaro.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
    Signed-off-by: Ingo Molnar

    Qian Cai