16 Jun, 2020

1 commit

  • Pull more ext4 updates from Ted Ts'o:
    "This is the second round of ext4 commits for 5.8 merge window [1].

    It includes the per-inode DAX support, which was dependant on the DAX
    infrastructure which came in via the XFS tree, and a number of
    regression and bug fixes; most notably the "BUG: using
    smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
    by syzkaller"

    [1] The pull request actually came in 15 minutes after I had tagged the
    rc1 release. Tssk, tssk, late.. - Linus

    * tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers
    ext4: support xattr gnu.* namespace for the Hurd
    ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
    ext4: avoid utf8_strncasecmp() with unstable name
    ext4: stop overwrite the errcode in ext4_setup_super
    ext4: fix partial cluster initialization when splitting extent
    ext4: avoid race conditions when remounting with options that change dax
    Documentation/dax: Update DAX enablement for ext4
    fs/ext4: Introduce DAX inode flag
    fs/ext4: Remove jflag variable
    fs/ext4: Make DAX mount option a tri-state
    fs/ext4: Only change S_DAX on inode load
    fs/ext4: Update ext4_should_use_dax()
    fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
    fs/ext4: Disallow verity if inode is DAX
    fs/ext4: Narrow scope of DAX check in setflags

    Linus Torvalds
     

13 Jun, 2020

1 commit

  • In the ext4 filesystem with errors=panic, if one process is recording
    errno in the superblock when invoking jbd2_journal_abort() due to some
    error cases, it could be raced by another __ext4_abort() which is
    setting the SB_RDONLY flag but missing panic because errno has not been
    recorded.

    jbd2_journal_commit_transaction()
    jbd2_journal_abort()
    journal->j_flags |= JBD2_ABORT;
    jbd2_journal_update_sb_errno()
    | ext4_journal_check_start()
    | __ext4_abort()
    | sb->s_flags |= SB_RDONLY;
    | if (!JBD2_REC_ERR)
    | return;
    journal->j_flags |= JBD2_REC_ERR;

    Finally, it will no longer trigger panic because the filesystem has
    already been set read-only. Fix this by introduce j_abort_mutex to make
    sure journal abort is completed before panic, and remove JBD2_REC_ERR
    flag.

    Fixes: 4327ba52afd03 ("ext4, jbd2: ensure entering into panic after recording an error in superblock")
    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200609073540.3810702-1-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     

06 Jun, 2020

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "A lot of bug fixes and cleanups for ext4, including:

    - Fix performance problems found in dioread_nolock now that it is the
    default, caused by transaction leaks.

    - Clean up fiemap handling in ext4

    - Clean up and refactor multiple block allocator (mballoc) code

    - Fix a problem with mballoc with a smaller file systems running out
    of blocks because they couldn't properly use blocks that had been
    reserved by inode preallocation.

    - Fixed a race in ext4_sync_parent() versus rename()

    - Simplify the error handling in the extent manipulation code

    - Make sure all metadata I/O errors are felected to
    ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.

    - Avoid passing an error pointer to brelse in ext4_xattr_set()

    - Fix race which could result to freeing an inode on the dirty last
    in data=journal mode.

    - Fix refcount handling if ext4_iget() fails

    - Fix a crash in generic/019 caused by a corrupted extent node"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
    ext4: avoid unnecessary transaction starts during writeback
    ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
    ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
    fs: remove the access_ok() check in ioctl_fiemap
    fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
    fs: move fiemap range validation into the file systems instances
    iomap: fix the iomap_fiemap prototype
    fs: move the fiemap definitions out of fs.h
    fs: mark __generic_block_fiemap static
    ext4: remove the call to fiemap_check_flags in ext4_fiemap
    ext4: split _ext4_fiemap
    ext4: fix fiemap size checks for bitmap files
    ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
    add comment for ext4_dir_entry_2 file_type member
    jbd2: avoid leaking transaction credits when unreserving handle
    ext4: drop ext4_journal_free_reserved()
    ext4: mballoc: use lock for checking free blocks while retrying
    ext4: mballoc: refactor ext4_mb_good_group()
    ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
    ext4: mballoc: refactor ext4_mb_discard_preallocations()
    ...

    Linus Torvalds
     

04 Jun, 2020

1 commit

  • When reserved transaction handle is unused, we subtract its reserved
    credits in __jbd2_journal_unreserve_handle() called from
    jbd2_journal_stop(). However this function forgets to remove reserved
    credits from transaction->t_outstanding_credits and thus the transaction
    space that was reserved remains effectively leaked. The leaked
    transaction space can be quite significant in some cases and leads to
    unnecessarily small transactions and thus reducing throughput of the
    journalling machinery. E.g. fsmark workload creating lots of 4k files
    was observed to have about 20% lower throughput due to this when ext4 is
    mounted with dioread_nolock mount option.

    Subtract reserved credits from t_outstanding_credits as well.

    CC: stable@vger.kernel.org
    Fixes: 8f7d89f36829 ("jbd2: transaction reservation support")
    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

22 May, 2020

1 commit


06 Mar, 2020

1 commit

  • Improve comments in jbd2_journal_commit_transaction() to describe why
    we don't need to clear the buffer_mapped bit for freeing file mapping
    buffers whose page mapping is NULL.

    Link: https://lore.kernel.org/r/20200217112706.20085-1-yi.zhang@huawei.com
    Fixes: c96dceeabf76 ("jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer")
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: zhangyi (F)
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     

01 Mar, 2020

1 commit

  • journal_head::b_transaction and journal_head::b_next_transaction could
    be accessed concurrently as noticed by KCSAN,

    LTP: starting fsync04
    /dev/zero: Can't open blockdev
    EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
    EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
    ==================================================================
    BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]

    write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
    __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
    __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
    jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
    (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
    kjournald2+0x13b/0x450 [jbd2]
    kthread+0x1cd/0x1f0
    ret_from_fork+0x27/0x50

    read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
    jbd2_write_access_granted+0x1b2/0x250 [jbd2]
    jbd2_write_access_granted at fs/jbd2/transaction.c:1155
    jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
    __ext4_journal_get_write_access+0x50/0x90 [ext4]
    ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
    ext4_mb_new_blocks+0x54f/0xca0 [ext4]
    ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
    ext4_map_blocks+0x3b4/0x950 [ext4]
    _ext4_get_block+0xfc/0x270 [ext4]
    ext4_get_block+0x3b/0x50 [ext4]
    __block_write_begin_int+0x22e/0xae0
    __block_write_begin+0x39/0x50
    ext4_write_begin+0x388/0xb50 [ext4]
    generic_perform_write+0x15d/0x290
    ext4_buffered_write_iter+0x11f/0x210 [ext4]
    ext4_file_write_iter+0xce/0x9e0 [ext4]
    new_sync_write+0x29c/0x3b0
    __vfs_write+0x92/0xa0
    vfs_write+0x103/0x260
    ksys_write+0x9d/0x130
    __x64_sys_write+0x4c/0x60
    do_syscall_64+0x91/0xb05
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    5 locks held by fsync04/25724:
    #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
    #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
    #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
    #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
    #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
    irq event stamp: 1407125
    hardirqs last enabled at (1407125): [] __find_get_block+0x107/0x790
    hardirqs last disabled at (1407124): [] __find_get_block+0x49/0x790
    softirqs last enabled at (1405528): [] __do_softirq+0x34c/0x57c
    softirqs last disabled at (1405521): [] irq_exit+0xa2/0xc0

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    The plain reads are outside of jh->b_state_lock critical section which result
    in data races. Fix them by adding pairs of READ|WRITE_ONCE().

    Reviewed-by: Jan Kara
    Signed-off-by: Qian Cai
    Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
    Signed-off-by: Theodore Ts'o

    Qian Cai
     

22 Feb, 2020

1 commit

  • I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
    The running environment:
    kernel version: 4.19
    A cluster with two nodes, 5 luns mounted on two nodes, and do some
    file operations like dd/fallocate/truncate/rm on every lun with storage
    network disconnection.

    The fallocate operation on dm-23-45 caused an null pointer dereference.

    The information of NULL pointer dereference as follows:
    [577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45.
    [577992.878290] Aborting journal on device dm-23-45.
    ...
    [577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46.
    [577992.890908] __journal_remove_journal_head: freeing b_committed_data
    [577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30
    [577992.890918] __journal_remove_journal_head: freeing b_committed_data
    [577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30
    [577992.890922] __journal_remove_journal_head: freeing b_committed_data
    [577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30
    [577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30
    [577992.890928] __journal_remove_journal_head: freeing b_committed_data
    [577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30
    [577992.890933] __journal_remove_journal_head: freeing b_committed_data
    [577992.890939] __journal_remove_journal_head: freeing b_committed_data
    [577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
    [577992.890950] Mem abort info:
    [577992.890951] ESR = 0x96000004
    [577992.890952] Exception class = DABT (current EL), IL = 32 bits
    [577992.890952] SET = 0, FnV = 0
    [577992.890953] EA = 0, S1PTW = 0
    [577992.890954] Data abort info:
    [577992.890955] ISV = 0, ISS = 0x00000004
    [577992.890956] CM = 0, WnR = 0
    [577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9
    [577992.890960] [0000000000000020] pgd=0000000000000000
    [577992.890964] Internal error: Oops: 96000004 [#1] SMP
    [577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd)
    [577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G W OE 4.19.36 #1
    [577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
    [577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO)
    [577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
    [577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2]
    [577992.891084] sp : ffff0000c8e2b810
    [577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000
    [577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70
    [577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2
    [577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30
    [577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000
    [577992.891097] x19: ffff000001681638 x18: ffffffffffffffff
    [577992.891098] x17: 0000000000000000 x16: ffff000080a03df0
    [577992.891100] x15: ffff0000811d9708 x14: 203d207375746174
    [577992.891101] x13: 73203a524f525245 x12: 20373439343a6565
    [577992.891103] x11: 0000000000000038 x10: 0101010101010101
    [577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f
    [577992.891109] x7 : 0000000000000000 x6 : 0000000000000080
    [577992.891110] x5 : 0000000000000000 x4 : 0000000000000002
    [577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00
    [577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000
    [577992.891116] Call trace:
    [577992.891139] _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
    [577992.891162] _ocfs2_free_clusters+0x100/0x290 [ocfs2]
    [577992.891185] ocfs2_free_clusters+0x50/0x68 [ocfs2]
    [577992.891206] ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2]
    [577992.891227] ocfs2_add_inode_data+0x94/0xc8 [ocfs2]
    [577992.891248] ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2]
    [577992.891269] ocfs2_allocate_extents+0x14c/0x338 [ocfs2]
    [577992.891290] __ocfs2_change_file_space+0x3f8/0x610 [ocfs2]
    [577992.891309] ocfs2_fallocate+0xe4/0x128 [ocfs2]
    [577992.891316] vfs_fallocate+0x11c/0x250
    [577992.891317] ksys_fallocate+0x54/0x88
    [577992.891319] __arm64_sys_fallocate+0x28/0x38
    [577992.891323] el0_svc_common+0x78/0x130
    [577992.891325] el0_svc_handler+0x38/0x78
    [577992.891327] el0_svc+0x8/0xc

    My analysis process as follows:
    ocfs2_fallocate
    __ocfs2_change_file_space
    ocfs2_allocate_extents
    ocfs2_extend_allocation
    ocfs2_add_inode_data
    ocfs2_add_clusters_in_btree
    ocfs2_insert_extent
    ocfs2_do_insert_extent
    ocfs2_rotate_tree_right
    ocfs2_extend_rotate_transaction
    ocfs2_extend_trans
    jbd2_journal_restart
    jbd2__journal_restart
    /* handle->h_transaction is NULL,
    * is_handle_aborted(handle) is true
    */
    handle->h_transaction = NULL;
    start_this_handle
    return -EROFS;
    ocfs2_free_clusters
    _ocfs2_free_clusters
    _ocfs2_free_suballoc_bits
    ocfs2_block_group_clear_bits
    ocfs2_journal_access_gd
    __ocfs2_journal_access
    jbd2_journal_get_undo_access
    /* I think jbd2_write_access_granted() will
    * return true, because do_get_write_access()
    * will return -EROFS.
    */
    if (jbd2_write_access_granted(...)) return 0;
    do_get_write_access
    /* handle->h_transaction is NULL, it will
    * return -EROFS here, so do_get_write_access()
    * was not called.
    */
    if (is_handle_aborted(handle)) return -EROFS;
    /* bh2jh(group_bh) is NULL, caused NULL
    pointer dereference */
    undo_bg = (struct ocfs2_group_desc *)
    bh2jh(group_bh)->b_committed_data;

    If handle->h_transaction == NULL, then jbd2_write_access_granted()
    does not really guarantee that journal_head will stay around,
    not even speaking of its b_committed_data. The bh2jh(group_bh)
    can be removed after ocfs2_journal_access_gd() and before call
    "bh2jh(group_bh)->b_committed_data". So, we should move
    is_handle_aborted() check from do_get_write_access() into
    jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
    before the call to jbd2_write_access_granted().

    Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.com
    Signed-off-by: Yan Wang
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jun Piao
    Reviewed-by: Jan Kara
    Cc: stable@kernel.org

    wangyan
     

17 Feb, 2020

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "Miscellaneous ext4 bug fixes (all stable fodder)"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: improve explanation of a mount failure caused by a misconfigured kernel
    jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer
    jbd2: move the clearing of b_modified flag to the journal_unmap_buffer()
    ext4: add cond_resched() to ext4_protect_reserved_inode
    ext4: fix checksum errors with indexed dirs
    ext4: fix support for inode sizes > 1024 bytes
    ext4: simplify checking quota limits in ext4_statfs()
    ext4: don't assume that mmp_nodename/bdevname have NUL

    Linus Torvalds
     

14 Feb, 2020

2 commits

  • Commit 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from
    an older transaction") set the BH_Freed flag when forgetting a metadata
    buffer which belongs to the committing transaction, it indicate the
    committing process clear dirty bits when it is done with the buffer. But
    it also clear the BH_Mapped flag at the same time, which may trigger
    below NULL pointer oops when block_size < PAGE_SIZE.

    rmdir 1 kjournald2 mkdir 2
    jbd2_journal_commit_transaction
    commit transaction N
    jbd2_journal_forget
    set_buffer_freed(bh1)
    jbd2_journal_commit_transaction
    commit transaction N+1
    ...
    clear_buffer_mapped(bh1)
    ext4_getblk(bh2 ummapped)
    ...
    grow_dev_page
    init_page_buffers
    bh1->b_private=NULL
    bh2->b_private=NULL
    jbd2_journal_put_journal_head(jh1)
    __journal_remove_journal_head(hb1)
    jh1 is NULL and trigger oops

    *) Dir entry block bh1 and bh2 belongs to one page, and the bh2 has
    already been unmapped.

    For the metadata buffer we forgetting, we should always keep the mapped
    flag and clear the dirty flags is enough, so this patch pick out the
    these buffers and keep their BH_Mapped flag.

    Link: https://lore.kernel.org/r/20200213063821.30455-3-yi.zhang@huawei.com
    Fixes: 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from an older transaction")
    Reviewed-by: Jan Kara
    Signed-off-by: zhangyi (F)
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    zhangyi (F)
     
  • There is no need to delay the clearing of b_modified flag to the
    transaction committing time when unmapping the journalled buffer, so
    just move it to the journal_unmap_buffer().

    Link: https://lore.kernel.org/r/20200213063821.30455-2-yi.zhang@huawei.com
    Reviewed-by: Jan Kara
    Signed-off-by: zhangyi (F)
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    zhangyi (F)
     

09 Feb, 2020

1 commit

  • Pull misc vfs updates from Al Viro:

    - bmap series from cmaiolino

    - getting rid of convolutions in copy_mount_options() (use a couple of
    copy_from_user() instead of the __get_user() crap)

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    saner copy_mount_options()
    fibmap: Reject negative block numbers
    fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
    ecryptfs: drop direct calls to ->bmap
    cachefiles: drop direct usage of ->bmap method.
    fs: Enable bmap() function to properly return errors

    Linus Torvalds
     

04 Feb, 2020

1 commit

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

03 Feb, 2020

1 commit

  • By now, bmap() will either return the physical block number related to
    the requested file offset or 0 in case of error or the requested offset
    maps into a hole.
    This patch makes the needed changes to enable bmap() to proper return
    errors, using the return value as an error return, and now, a pointer
    must be passed to bmap() to be filled with the mapped physical block.

    It will change the behavior of bmap() on return:

    - negative value in case of error
    - zero on success or map fell into a hole

    In case of a hole, the *block will be zero too

    Since this is a prep patch, by now, the only error return is -EINVAL if
    ->bmap doesn't exist.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Al Viro

    Carlos Maiolino
     

25 Jan, 2020

8 commits

  • __jbd2_journal_abort_hard() is no longer used, so now we can merge
    __jbd2_journal_abort_hard() and __journal_abort_soft() these two
    functions into jbd2_journal_abort() and remove them.

    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20191204124614.45424-5-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • Commit fb7c02445c49 ("ext4: pass -ESHUTDOWN code to jbd2 layer") want
    to allow jbd2 layer to distinguish shutdown journal abort from other
    error cases. So the ESHUTDOWN should be taken precedence over any other
    errno which has already been recoded after EXT4_FLAGS_SHUTDOWN is set,
    but it only update errno in the journal suoerblock now if the old errno
    is 0.

    Fixes: fb7c02445c49 ("ext4: pass -ESHUTDOWN code to jbd2 layer")
    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20191204124614.45424-4-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • JBD2_REC_ERR flag used to indicate the errno has been updated when jbd2
    aborted, and then __ext4_abort() and ext4_handle_error() can invoke
    panic if ERRORS_PANIC is specified. But if the journal has been aborted
    with zero errno, jbd2_journal_abort() didn't set this flag so we can
    no longer panic. Fix this by always record the proper errno in the
    journal superblock.

    Fixes: 4327ba52afd03 ("ext4, jbd2: ensure entering into panic after recording an error in superblock")
    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20191204124614.45424-3-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • We invoke jbd2_journal_abort() to abort the journal and record errno
    in the jbd2 superblock when committing journal transaction besides the
    failure on submitting the commit record. But there is no need for the
    case and we can also invoke jbd2_journal_abort() instead of
    __jbd2_journal_abort_hard().

    Fixes: 818d276ceb83a ("ext4: Add the journal checksum feature")
    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20191204124614.45424-2-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • if seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output.

    Script below generates endless output
    $ q=;while read -r r;do echo "$((++q)) $r";done
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/d13805e5-695e-8ac3-b678-26ca2313629f@virtuozzo.com
    Signed-off-by: Theodore Ts'o

    Vasily Averin
     
  • Only when jh->b_jcount = 0 in jbd2_journal_put_journal_head, we are allowed
    to call __journal_remove_journal_head. This assertion is meaningless,
    just remove it.

    Signed-off-by: Shijie Luo
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20200123070054.50585-1-luoshijie1@huawei.com
    Signed-off-by: Theodore Ts'o

    Shijie Luo
     
  • Fix comment and remove unneccessary blank.

    Signed-off-by: Shijie Luo
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20200123064325.36358-1-luoshijie1@huawei.com
    Signed-off-by: Theodore Ts'o

    Shijie Luo
     
  • Delete the duplicated words "is" in the comments

    Signed-off-by: Yan Wang
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/12087f77-ab4d-c7ba-53b4-893dbf0026f0@huawei.com
    Signed-off-by: Theodore Ts'o

    wangyan
     

18 Jan, 2020

1 commit

  • If the journal is dirty when the filesystem is mounted, jbd2 will replay
    the journal but the journal superblock will not be updated by
    journal_reset() because JBD2_ABORT flag is still set (it was set in
    journal_init_common()). This is problematic because when a new transaction
    is then committed, it will be recorded in block 1 (journal->j_tail was set
    to 1 in journal_reset()). If unclean shutdown happens again before the
    journal superblock is updated, the new recorded transaction will not be
    replayed during the next mount (because of stale sb->s_start and
    sb->s_sequence values) which can lead to filesystem corruption.

    Fixes: 85e0c4e89c1b ("jbd2: if the journal is aborted then don't allow update of the log tail")
    Signed-off-by: Kai Li
    Link: https://lore.kernel.org/r/20200111022542.5008-1-li.kai4@h3c.com
    Signed-off-by: Theodore Ts'o

    Kai Li
     

01 Dec, 2019

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "This merge window saw the the following new featuers added to ext4:

    - Direct I/O via iomap (required the iomap-for-next branch from
    Darrick as a prereq).

    - Support for using dioread-nolock where the block size < page size.

    - Support for encryption for file systems where the block size < page
    size.

    - Rework of journal credits handling so a revoke-heavy workload will
    not cause the journal to run out of space.

    - Replace bit-spinlocks with spinlocks in jbd2

    Also included were some bug fixes and cleanups, mostly to clean up
    corner cases from fuzzed file systems and error path handling"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (59 commits)
    ext4: work around deleting a file with i_nlink == 0 safely
    ext4: add more paranoia checking in ext4_expand_extra_isize handling
    jbd2: make jbd2_handle_buffer_credits() handle reserved handles
    ext4: fix a bug in ext4_wait_for_tail_page_commit
    ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails
    ext4: code cleanup for get_next_id
    ext4: fix leak of quota reservations
    ext4: remove unused variable warning in parse_options()
    ext4: Enable encryption for subpage-sized blocks
    fs/buffer.c: support fscrypt in block_read_full_page()
    ext4: Add error handling for io_end_vec struct allocation
    jbd2: Fine tune estimate of necessary descriptor blocks
    jbd2: Provide trace event for handle restarts
    ext4: Reserve revoke credits for freed blocks
    jbd2: Make credit checking more strict
    jbd2: Rename h_buffer_credits to h_total_credits
    jbd2: Reserve space for revoke descriptor blocks
    jbd2: Drop jbd2_space_needed()
    jbd2: Account descriptor blocks into t_outstanding_credits
    jbd2: Factor out common parts of stopping and restarting a handle
    ...

    Linus Torvalds
     

06 Nov, 2019

15 commits

  • Theodore Ts'o
     
  • Currently we reserve j_max_transaction_buffers / 32 for transaction
    descriptor blocks. Now that revoke descriptors are accounted for
    separately this estimate is unnecessarily high and we can actually
    compute much tighter estimate. In the common case of 32k journal blocks
    and 4k blocksize this actually reduces the amount of reserved descriptor
    blocks from 256 to ~25 which allows us to fit more real data into a
    transaction.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-25-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Provide trace event for handle restarts to ease debugging.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-24-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Make checking of available credits in jbd2_journal_dirty_metadata() more
    strict. There should be always enough credits in the handle to write all
    potential revoke descriptors. Also we warn in case there are not enough
    credits since this is a bug in the filesystem.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-22-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • The credit counter now contains both buffer and revoke descriptor block
    credits. Rename to counter to h_total_credits to reflect that. No
    functional change.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-21-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Extend functions for starting, extending, and restarting transaction
    handles to take number of revoke records handle must be able to
    accommodate. These functions then make sure transaction has enough
    credits to be able to store resulting revoke descriptor blocks. Also
    revoke code tracks number of revoke records created by a handle to catch
    situation where some place didn't reserve enough space for revoke
    records. Similarly to standard transaction credits, space for unused
    reserved revoke records is released when the handle is stopped.

    On the ext4 side we currently take a simplistic approach of reserving
    space for 1024 revoke records for any transaction. This grows amount of
    credits reserved for each handle only by a few and is enough for any
    normal workload so that we don't hit warnings in jbd2. We will refine
    the logic in following commits.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-20-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • The function is now just a trivial wrapper returning
    journal->j_max_transaction_buffers. Drop it.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-19-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently, journal descriptor blocks were not accounted in
    transaction->t_outstanding_credits and we were just leaving some slack
    space in the journal for them (in jbd2_log_space_left() and
    jbd2_space_needed()). This is making proper accounting (and reservation
    we want to add) of descriptor blocks difficult so switch to accounting
    descriptor blocks in transaction->t_outstanding_credits and just reserve
    the same amount of credits in t_outstanding credits for journal
    descriptor blocks when creating transaction.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-18-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • jbd2__journal_restart() has quite some code that is common with
    jbd2_journal_stop(). Factor this functionality into stop_this_handle()
    helper and use it from both functions. Note that this also drops
    t_handle_lock protection from jbd2__journal_restart() as
    jbd2_journal_stop() does the same thing without it.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-17-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When we drop last handle from a transaction and journal->j_barrier_count
    > 0, jbd2_journal_stop() wakes up journal->j_wait_transaction_locked
    wait queue. This looks pointless - wait for outstanding handles always
    happens on journal->j_wait_updates waitqueue.
    journal->j_wait_transaction_locked is used to wait for transaction state
    changes and by start_this_handle() for waiting until
    journal->j_barrier_count drops to 0. The first case is clearly
    irrelevant here since only jbd2 thread changes transaction state. The
    second case looks related but jbd2_journal_unlock_updates() is
    responsible for the wakeup in this case. So just drop the wakeup.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-16-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • If a transaction is larger than journal->j_max_transaction_buffers, that
    is a bug and not a trigger for transaction commit. Also the very next
    attempt to start new handle will start transaction commit anyway. So
    just remove the pointless check. Arguably, we could start transaction
    commit whenever the transaction size is *close* to
    journal->j_max_transaction_buffers. This has a potential to reduce
    latency of the next jbd2_journal_start() at the cost of somewhat smaller
    transactions. However for this to have any effect, it would mean that
    there isn't someone already waiting in jbd2_journal_start() which means
    metadata load for the fs is pretty light anyway so probably this
    optimization is not worth it.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-15-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Move code in jbd2_journal_stop() around a bit. It removes some
    unnecessary code duplication and will make factoring out parts common
    with jbd2__journal_restart() easier.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-14-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • jbd2 statistics counting number of blocks logged in a transaction was
    wrong. It didn't count the commit block and more importantly it didn't
    count revoke descriptor blocks. Make sure these get properly counted.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-13-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • With 32-bit block numbers, we don't allocate the array for journal
    buffer heads large enough for corresponding descriptor tags to fill the
    descriptor block. Thus we end up writing out half-full descriptor blocks
    to the journal unnecessarily growing the transaction. Fix the logic to
    allocate the array large enough.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-3-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • jbd2_journal_next_log_block() does not look at
    transaction->t_outstanding_credits. Remove the misleading comment.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-2-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

21 Oct, 2019

1 commit

  • On PREEMPT_RT bit-spinlocks have the same semantics as on PREEMPT_RT=n,
    i.e. they disable preemption. That means functions which are not safe to be
    called in preempt disabled context on RT trigger a might_sleep() assert.

    The journal head bit spinlock is mostly held for short code sequences with
    trivial RT safe functionality, except for one place:

    jbd2_journal_put_journal_head() invokes __journal_remove_journal_head()
    with the journal head bit spinlock held. __journal_remove_journal_head()
    invokes kmem_cache_free() which must not be called with preemption disabled
    on RT.

    Jan suggested to rework the removal function so the actual free happens
    outside the bit-spinlocked region.

    Split it into two parts:

    - Do the sanity checks and the buffer head detach under the lock

    - Do the actual free after dropping the lock

    There is error case handling in the free part which needs to dereference
    the b_size field of the now detached buffer head. Due to paranoia (caused
    by ignorance) the size is retrieved in the detach function and handed into
    the free function. Might be over-engineered, but better safe than sorry.

    This makes the journal head bit-spinlock usage RT compliant and also avoids
    nested locking which is not covered by lockdep.

    Suggested-by: Jan Kara
    Signed-off-by: Thomas Gleixner
    Cc: linux-ext4@vger.kernel.org
    Cc: "Theodore Ts'o"
    Cc: Jan Kara
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20190809124233.13277-8-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Thomas Gleixner