12 Jun, 2018

1 commit

  • commit e2731e55884f2138a252b0a3d7b24d57e49c3c59 upstream.

    btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
    So just define that in kernel so that we know its been used.

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     

05 Jun, 2018

3 commits

  • commit a27ba2607e60312554cbcd43fc660b2c7f29dc9c upstream.

    The struct xfs_agfl v5 header was originally introduced with
    unexpected padding that caused the AGFL to operate with one less
    slot than intended. The header has since been packed, but the fix
    left an incompatibility for users who upgrade from an old kernel
    with the unpacked header to a newer kernel with the packed header
    while the AGFL happens to wrap around the end. The newer kernel
    recognizes one extra slot at the physical end of the AGFL that the
    previous kernel did not. The new kernel will eventually attempt to
    allocate a block from that slot, which contains invalid data, and
    cause a crash.

    This condition can be detected by comparing the active range of the
    AGFL to the count. While this detects a padding mismatch, it can
    also trigger false positives for unrelated flcount corruption. Since
    we cannot distinguish a size mismatch due to padding from unrelated
    corruption, we can't trust the AGFL enough to simply repopulate the
    empty slot.

    Instead, avoid unnecessarily complex detection logic and and use a
    solution that can handle any form of flcount corruption that slips
    through read verifiers: distrust the entire AGFL and reset it to an
    empty state. Any valid blocks within the AGFL are intentionally
    leaked. This requires xfs_repair to rectify (which was already
    necessary based on the state the AGFL was found in). The reset
    mitigates the side effect of the padding mismatch problem from a
    filesystem crash to a free space accounting inconsistency. The
    generic approach also means that this patch can be safely backported
    to kernels with or without a packed struct xfs_agfl.

    Check the AGF for an invalid freelist count on initial read from
    disk. If detected, set a flag on the xfs_perag to indicate that a
    reset is required before the AGFL can be used. In the first
    transaction that attempts to use a flagged AGFL, reset it to empty,
    warn the user about the inconsistency and allow the freelist fixup
    code to repopulate the AGFL with new blocks. The xfs_perag flag is
    cleared to eliminate the need for repeated checks on each block
    allocation operation.

    This allows kernels that include the packing fix commit 96f859d52bcb
    ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
    to handle older unpacked AGFL formats without a filesystem crash.

    Suggested-by: Dave Chinner
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Reviewed-by Dave Chiluk
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     
  • commit a78ee256c325ecfaec13cafc41b315bd4e1dd518 upstream.

    The AGFL size calculation is about to get more complex, so lets turn
    the macro into a function first and remove the macro.

    Signed-off-by: Dave Chinner
    [darrick: forward port to newer kernel, simplify the helper]
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit 4faa99965e027cc057c5145ce45fa772caa04e8d upstream.

    If io_destroy() gets to cancelling everything that can be cancelled and
    gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
    it becomes vulnerable to a race with IO completion. At that point req
    is already taken off the list and aio_complete() does *NOT* spin until
    we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
    to kiocb_free(), freing req just it gets passed to ->ki_cancel().

    Fix is simple - remove from the list after the call of kiocb_cancel(). All
    instances of ->ki_cancel() already have to cope with the being called with
    iocb still on list - that's what happens in io_cancel(2).

    Cc: stable@kernel.org
    Fixes: 0460fef2a921 "aio: use cancellation list lazily"
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

30 May, 2018

32 commits

  • [ Upstream commit 116e5258e4115aca0c64ac0bf40ded3b353ed626 ]

    Currently when UDF filesystem is recorded without uid / gid (ids are set
    to -1), we will assign INVALID_[UG]ID to vfs inode unless user uses uid=
    and gid= mount options. In such case filesystem could not be modified in
    any way as VFS refuses to modify files with invalid ids (even by root).
    This is confusing to users and not very useful default since such media
    mode is generally used for removable media. Use overflow[ug]id instead
    so that at least root can modify the filesystem.

    Reported-by: Steve Kenton
    Reviewed-by: Pali Rohár
    Signed-off-by: Jan Kara
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • [ Upstream commit 174d1232ebc84fcde8f5889d1171c9c7e74a10a7 ]

    The chunk size of allocations in __gfs2_fallocate is calculated
    incorrectly. The size can collapse, causing __gfs2_fallocate to
    allocate one block at a time, which is very inefficient. This needs
    fixing in two places:

    In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate
    that there is no quota limit. This fixes callers that rely on
    ap->allowed to be set even when quotas are off.

    In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the
    loop to make sure that allocation limits from one resource group won't
    spill over into another resource group.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andreas Gruenbacher
     
  • [ Upstream commit bf617f7a92edc6bb2909db2bfa4576f50b280ee5 ]

    If noextent_cache mount option is on, we will never initialize extent tree
    in inode, but still we're going to access it in f2fs_drop_extent_tree,
    result in kernel panic as below:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
    IP: _raw_write_lock+0xc/0x30
    Call Trace:
    ? f2fs_drop_extent_tree+0x41/0x70 [f2fs]
    f2fs_fallocate+0x5a0/0xdd0 [f2fs]
    ? common_file_perm+0x47/0xc0
    ? apparmor_file_permission+0x1a/0x20
    vfs_fallocate+0x15b/0x290
    SyS_fallocate+0x44/0x70
    do_syscall_64+0x6e/0x160
    entry_SYSCALL64_slow_path+0x25/0x25

    This patch fixes to check extent cache status before using in
    f2fs_drop_extent_tree.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • [ Upstream commit cd36d7a17f9da68be9aa67185ba3ad7969934a19 ]

    Once CP_TRIMMED_FLAG is set, after a reboot, we will never issue discard
    before LBA becomes invalid again, fix it by clearing the flag in
    checkpoint without CP_TRIMMED reason.

    Fixes: 1f43e2ad7bff ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • [ Upstream commit 17cd07ae95073c298af92c1ba14ac58ce84de33b ]

    As Jayashree Mohan reported:

    A simple workload to reproduce this would be :
    1. create foo
    2. Write (8K - 16K) // foo size = 16K now
    3. fsync()
    4. falloc zero_range , keep_size (4202496 - 4210688) // foo size must be 16K
    5. fdatasync()
    Crash now

    On recovery, we see that the file size is 4210688 and not 16K, which
    violates the semantics of keep_size flag. We have a test case to
    reproduce this using CrashMonkey on 4.15 kernel. Try this out by
    simply running :
    ./c_harness -f /dev/sda -d /dev/cow_ram0 -t f2fs -e 102400 -P -v
    tests/generic_468_zero.so

    The root cause is that we miss to set KEEP_SIZE bit correctly in zero_range
    when zeroing block cross EOF with FALLOC_FL_KEEP_SIZE, let's fix this
    missing case.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • [ Upstream commit 0d9366d67bcf066b028e57d09c9a86ce879bcc28 ]

    If mount is auto-probing for filesystem type, it will try various
    filesystems in order, with the MS_SILENT flag set. We get
    that flag as the silent arg to ext4_fill_super.

    If we're probing (silent==1) then don't complain about feature
    incompatibilities that are found if it looks like it's actually
    a different valid extN type - failed probes should be silent
    in this case.

    If the on-disk features are unknown even to ext4, then complain.

    Reported-by: Joakim Tjernlund
    Tested-by: Joakim Tjernlund
    Signed-off-by: Eric Sandeen
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     
  • …created with quota enabled

    [ Upstream commit 4d31778aa2fa342f5f92ca4025b293a1729161d1 ]

    When multiple pending snapshots referring to the same source subvolume
    are executed, enabled quota will cause root item corruption, where root
    items are using old bytenr (no backref in extent tree).

    This can be triggered by fstests btrfs/152.

    The cause is when source subvolume is still dirty, extra commit
    (simplied transaction commit) of qgroup_account_snapshot() can skip
    dirty roots not recorded in current transaction, making root item of
    source subvolume not updated.

    Fix it by forcing recording source subvolume in current transaction
    before qgroup sub-transaction commit.

    Reported-by: Justin Maggard <jmaggard@netgear.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

    Qu Wenruo
     
  • [ Upstream commit 8a5a916d9a35e13576d79cc16e24611821b13e34 ]

    While running btrfs/011, I hit the following lockdep splat.

    This is the important bit:
    pcpu_alloc+0x1ac/0x5e0
    __percpu_counter_init+0x4e/0xb0
    btrfs_init_fs_root+0x99/0x1c0 [btrfs]
    btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
    resolve_indirect_refs+0x130/0x830 [btrfs]
    find_parent_nodes+0x69e/0xff0 [btrfs]
    btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
    btrfs_find_all_roots+0x50/0x70 [btrfs]
    btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
    btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]

    The percpu_counter_init call in btrfs_alloc_subvolume_writers
    uses GFP_KERNEL, which we can't do during transaction commit.

    This switches it to GFP_NOFS.

    ========================================================
    WARNING: possible irq lock inversion dependency detected
    4.12.14-kvmsmall #8 Tainted: G W
    --------------------------------------------------------
    kswapd0/50 just changed the state of lock:
    (&delayed_node->mutex){+.+.-.}, at: [] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    but this lock took another, RECLAIM_FS-unsafe lock in the past:
    (pcpu_alloc_mutex){+.+.+.}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
    Chain exists of:
    &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(pcpu_alloc_mutex);
    local_irq_disable();
    lock(&delayed_node->mutex);
    lock(&found->groups_sem);

    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    2 locks held by kswapd0/50:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x7f/0x5b0
    #1: (&type->s_umount_key#30){+++++.}, at: [] trylock_super+0x16/0x50

    the shortest dependencies between 2nd lock and 1st lock:
    -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
    HARDIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    kmem_cache_init_late+0x42/0x75
    start_kernel+0x343/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    SOFTIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    kmem_cache_init_late+0x42/0x75
    start_kernel+0x343/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    RECLAIM_FS-ON-W at:
    __kmalloc+0x47/0x310
    pcpu_extend_area_map+0x2b/0xc0
    pcpu_alloc+0x3ec/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    __kmem_cache_create+0x1bf/0x390
    create_cache+0xba/0x1b0
    kmem_cache_create+0x1f8/0x2b0
    ksm_init+0x6f/0x19d
    do_one_initcall+0x50/0x1b0
    kernel_init_freeable+0x201/0x289
    kernel_init+0xa/0x100
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    setup_cpu_cache+0x2f/0x1f0
    __kmem_cache_create+0x1bf/0x390
    create_boot_cache+0x8b/0xb1
    kmem_cache_init+0xa1/0x19e
    start_kernel+0x270/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    }
    ... key at: [] pcpu_alloc_mutex+0x70/0xa0
    ... acquired at:
    pcpu_alloc+0x1ac/0x5e0
    __percpu_counter_init+0x4e/0xb0
    btrfs_init_fs_root+0x99/0x1c0 [btrfs]
    btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
    resolve_indirect_refs+0x130/0x830 [btrfs]
    find_parent_nodes+0x69e/0xff0 [btrfs]
    btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
    btrfs_find_all_roots+0x50/0x70 [btrfs]
    btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
    btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
    transaction_kthread+0x176/0x1b0 [btrfs]
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50

    -> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
    HARDIRQ-ON-W at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    HARDIRQ-ON-R at:
    down_read+0x35/0x90
    caching_thread+0x57/0x560 [btrfs]
    normal_work_helper+0x1c0/0x5e0 [btrfs]
    process_one_work+0x1e0/0x5c0
    worker_thread+0x44/0x390
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    SOFTIRQ-ON-W at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-R at:
    down_read+0x35/0x90
    caching_thread+0x57/0x560 [btrfs]
    normal_work_helper+0x1c0/0x5e0 [btrfs]
    process_one_work+0x1e0/0x5c0
    worker_thread+0x44/0x390
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
    ... acquired at:
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
    btrfs_create_tree+0xbb/0x2a0 [btrfs]
    btrfs_create_uuid_tree+0x37/0x140 [btrfs]
    open_ctree+0x23c0/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    -> (&found->groups_sem){++++..} ops: 2134587 {
    HARDIRQ-ON-W at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    HARDIRQ-ON-R at:
    down_read+0x35/0x90
    btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
    open_ctree+0x207b/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-W at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-R at:
    down_read+0x35/0x90
    btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
    open_ctree+0x207b/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    INITIAL USE at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
    ... acquired at:
    find_free_extent+0xcb4/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
    __btrfs_cow_block+0x110/0x5b0 [btrfs]
    btrfs_cow_block+0xd7/0x290 [btrfs]
    btrfs_search_slot+0x1f6/0x960 [btrfs]
    btrfs_lookup_inode+0x2a/0x90 [btrfs]
    __btrfs_update_delayed_inode+0x65/0x210 [btrfs]
    btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
    btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
    evict+0xc4/0x190
    __dentry_kill+0xbf/0x170
    dput+0x2ae/0x2f0
    SyS_rename+0x2a6/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    -> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
    HARDIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    IN-RECLAIM_FS-W at:
    __mutex_lock+0x4e/0x8c0
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
    ... acquired at:
    __lock_acquire+0x264/0x11c0
    lock_acquire+0xbd/0x1e0
    __mutex_lock+0x4e/0x8c0
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50

    stack backtrace:
    CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased)
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
    dump_stack+0x78/0xb7
    print_irq_inversion_bug.part.38+0x19f/0x1aa
    check_usage_forwards+0x102/0x120
    ? ret_from_fork+0x3a/0x50
    ? check_usage_backwards+0x110/0x110
    mark_lock+0x16c/0x270
    __lock_acquire+0x264/0x11c0
    ? pagevec_lookup_entries+0x1a/0x30
    ? truncate_inode_pages_range+0x2b3/0x7f0
    lock_acquire+0xbd/0x1e0
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    __mutex_lock+0x4e/0x8c0
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ? mem_cgroup_shrink_node+0x2c0/0x2c0
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x3a/0x50

    Signed-off-by: Jeff Mahoney
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • [ Upstream commit 8434ec46c6e3232cebc25a910363b29f5c617820 ]

    When logging an inode, at tree-log.c:copy_items(), if we call
    btrfs_next_leaf() at the loop which checks for the need to log holes, we
    need to make sure copy_items() returns the value 1 to its caller and
    not 0 (on success). This is because the path the caller passed was
    released and is now different from what is was before, and the caller
    expects a return value of 0 to mean both success and that the path
    has not changed, while a return value of 1 means both success and
    signals the caller that it can not reuse the path, it has to perform
    another tree search.

    Even though this is a case that should not be triggered on normal
    circumstances or very rare at least, its consequences can be very
    unpredictable (especially when replaying a log tree).

    Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a ]

    The extent tree of the test fs is like the following:

    BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
    item 0 key (4096 168 4096) itemoff 3944 itemsize 51
    extent refs 1 gen 1 flags 2
    tree block key (68719476736 0 0) level 1
    ^^^^^^^
    ref#0: tree block backref root 5

    And it's using an empty tree for fs tree, so there is no way that its
    level can be 1.

    For REAL (created by mkfs) fs tree backref with no skinny metadata, the
    result should look like:

    item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
    refs 1 gen 4 flags TREE_BLOCK
    tree block key (256 INODE_ITEM 0) level 0
    ^^^^^^^
    tree block backref root 5

    Fix the level to 0, so it won't break later tree level checker.

    Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • [ Upstream commit 2c98425720233ae3e135add0c7e869b32913502f ]

    If the fscache asynchronous write operation elects to discard a page that's
    pending storage to the cache because the page would be over the store limit
    then it needs to wake the page as someone may be waiting on completion of
    the write.

    The problem is that the store limit may be updated by a different
    asynchronous operation - and so may miss the write - and that the store
    limit may not even get updated until later by the netfs.

    Fix the kernel hang by making fscache_write_op() mark as written any pages
    that are over the limit.

    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • [ Upstream commit bb34f24c7d2c98d0c81838a7700e6068325b17a0 ]

    We should not handle migrate lockres if we are already in
    'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
    dlm domain. At last other nodes will get stuck into infinite loop when
    requsting lock from us.

    The problem is caused by concurrency umount between nodes. Before
    receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
    migrate target. So N2 will continue sending lockres to N1 even though
    N1 has left domain.

    N1 N2 (owner)
    touch file

    access the file,
    and get pr lock

    begin leave domain and
    pick up N1 as new owner

    begin leave domain and
    migrate all lockres done

    begin migrate lockres to N1

    end leave domain, but
    the lockres left
    unexpectedly, because
    migrate task has passed

    [piaojun@huawei.com: v3]
    Link: http://lkml.kernel.org/r/5A9CBD19.5020107@huawei.com
    Link: http://lkml.kernel.org/r/5A99F028.2090902@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jun Piao
     
  • [ Upstream commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 ]

    do_chunk_alloc implements a loop checking whether there is a pending
    chunk allocation and if so causes the caller do loop. Generally this
    loop is executed only once, however testing with btrfs/072 on a single
    core vm machines uncovered an extreme case where the system could loop
    indefinitely. This is due to a missing cond_resched when loop which
    doesn't give a chance to the previous chunk allocator finish its job.

    The fix is to simply add the missing cond_resched.

    Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • [ Upstream commit 80c0b4210a963e31529e15bf90519708ec947596 ]

    0, 1 and nodes[0] could be NULL, log_dir_items lacks such a
    check for
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 ]

    If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
    to bail out, otherwise @ret would be forced to be 0 after 'break;' and
    the caller won't be aware of it.

    Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 8c81dd46ef3c416b3b95e3020fb90dbd44e6140b ]

    Forcing the log to disk after reading the agf is wrong, we might be
    calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held.

    This can cause a deadlock when racing a fstrim with a filesystem
    shutdown.

    The deadlock has been identified due a miscalculation bug in device-mapper
    dm-thin, which returns lack of space to its users earlier than the device itself
    really runs out of space, changing the device-mapper volume into an error state.

    The problem happened while filling the filesystem with a single file,
    triggering the bug in device-mapper, consequently causing an IO error
    and shutting down the filesystem.

    If such file is removed, and fstrim executed before the XFS finishes the
    shut down process, the fstrim process will end up holding the buffer
    lock, and going to sleep on the cil wait queue.

    At this point, the shut down process will try to wake up all the threads
    waiting on the cil wait queue, but for this, it will try to hold the
    same buffer log already held my the fstrim, locking up the filesystem.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Carlos Maiolino
     
  • [ Upstream commit a0b0d1c345d0317efe594df268feb5ccc99f651e ]

    proc_sys_link_fill_cache() does not take currently unregistering sysctl
    tables into account, which might result into a page fault in
    sysctl_follow_link() - add a check to fix it.

    This bug has been present since v3.4.

    Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
    Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
    Signed-off-by: Danilo Krummrich
    Acked-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: "Luis R . Rodriguez"
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Danilo Krummrich
     
  • [ Upstream commit 471d557afed155b85da237ec46c549f443eeb5de ]

    Currently if we allocate extents beyond an inode's i_size (through the
    fallocate system call) and then fsync the file, we log the extents but
    after a power failure we replay them and then immediately drop them.
    This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs:
    Avoid orphan inodes cleanup while replaying log"), because it marks
    the inode as an orphan instead of dropping any extents beyond i_size
    before replaying logged extents, so after the log replay, and while
    the mount operation is still ongoing, we find the inode marked as an
    orphan and then perform a truncation (drop extents beyond the inode's
    i_size). Because the processing of orphan inodes is still done
    right after replaying the log and before the mount operation finishes,
    the intention of that commit does not make any sense (at least as
    of today). However reverting that behaviour is not enough, because
    we can not simply discard all extents beyond i_size and then replay
    logged extents, because we risk dropping extents beyond i_size created
    in past transactions, for example:

    add prealloc extent beyond i_size
    fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
    transaction commit
    add another prealloc extent beyond i_size
    fsync - triggers the fast fsync path
    power failure

    In that scenario, we would drop the first extent and then replay the
    second one. To fix this just make sure that all prealloc extents
    beyond i_size are logged, and if we find too many (which is far from
    a common case), fallback to a full transaction commit (like we do when
    logging regular extents in the fast fsync path).

    Trivial reproducer:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt
    $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
    $ sync
    $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
    $ xfs_io -c "fsync" /mnt/foo

    # mount to replay log
    $ mount /dev/sdb /mnt
    # at this point the file only has one extent, at offset 0, size 256K

    A test case for fstests follows soon, covering multiple scenarios that
    involve adding prealloc extents with previous shrinking truncates and
    without such truncates.

    Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit af7227338135d2f1b1552bf9a6d43e02dcba10b9 ]

    Currently if some fatal errors occur, like all IO get -EIO, resources
    would be cleaned up when
    a) transaction is being committed or
    b) BTRFS_FS_STATE_ERROR is set

    However, in some rare cases, resources may be left alone after transaction
    gets aborted and umount may run into some ASSERT(), e.g.
    ASSERT(list_empty(&block_group->dirty_list));

    For case a), in btrfs_commit_transaciton(), there're several places at the
    beginning where we just call btrfs_end_transaction() without cleaning up
    resources. For case b), it is possible that the trans handle doesn't have
    any dirty stuff, then only trans hanlde is marked as aborted while
    BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.

    This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
    all resources won't stay in memory after umount.

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 1c789249578895bb14ab62b4327306439b754857 ]

    There is lack of cache destroy operation for ceph_file_cachep
    when failing from fscache register.

    Signed-off-by: Chengguang Xu
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chengguang Xu
     
  • [ Upstream commit 9a6509c4daa91400b52a5fd541a5521c649a8fea ]

    If in the same transaction we rename a special file (fifo, character/block
    device or symbolic link), create a hard link for it having its old name
    then sync the log, we will end up with a log that can not be replayed and
    at when attempting to replay it, an EEXIST error is returned and mounting
    the filesystem fails. Example scenario:

    $ mkfs.btrfs -f /dev/sdc
    $ mount /dev/sdc /mnt
    $ mkdir /mnt/testdir
    $ mkfifo /mnt/testdir/foo
    # Make sure everything done so far is durably persisted.
    $ sync

    # Create some unrelated file and fsync it, this is just to create a log
    # tree. The file must be in the same directory as our special file.
    $ touch /mnt/testdir/f1
    $ xfs_io -c "fsync" /mnt/testdir/f1

    # Rename our special file and then create a hard link with its old name.
    $ mv /mnt/testdir/foo /mnt/testdir/bar
    $ ln /mnt/testdir/bar /mnt/testdir/foo

    # Create some other unrelated file and fsync it, this is just to persist
    # the log tree which was modified by the previous rename and link
    # operations. Alternatively we could have modified file f1 and fsync it.
    $ touch /mnt/f2
    $ xfs_io -c "fsync" /mnt/f2

    $ mount /dev/sdc /mnt
    mount: mount /dev/sdc on /mnt failed: File exists

    This happens because when both the log tree and the subvolume's tree have
    an entry in the directory "testdir" with the same name, that is, there
    is one key (258 INODE_REF 257) in the subvolume tree and another one in
    the log tree (where 258 is the inode number of our special file and 257
    is the inode for directory "testdir"). Only the data of those two keys
    differs, in the subvolume tree the index field for inode reference has
    a value of 3 while the log tree it has a value of 5. Because the same key
    exists in both trees, but have different index, the log replay fails with
    an -EEXIST error when attempting to replay the inode reference from the
    log tree.

    Fix this by setting the last_unlink_trans field of the inode (our special
    file) to the current transaction id when a hard link is created, as this
    forces logging the parent directory inode, solving the conflict at log
    replay time.

    A new generic test case for fstests was also submitted.

    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit d4dfc0f4d39475ccbbac947880b5464a74c30b99 ]

    When doing an incremental send of a filesystem with the no-holes feature
    enabled, we end up issuing a write operation when using the no data mode
    send flag, instead of issuing an update extent operation. Fix this by
    issuing the update extent operation instead.

    Trivial reproducer:

    $ mkfs.btrfs -f -O no-holes /dev/sdc
    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdc /mnt/sdc
    $ mount /dev/sdd /mnt/sdd

    $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
    $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1

    $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
    $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2

    $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
    $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
    | btrfs receive -vv /mnt/sdd

    Before this change the output of the second receive command is:

    receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
    utimes
    write foobar, offset 8192, len 8192
    utimes foobar
    BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...

    After this change it is:

    receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
    utimes
    update_extent foobar: offset=8192, len=8192
    utimes foobar
    BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...

    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit a8fd1f71749387c9a1053a83ff1c16287499a4e7 ]

    The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
    kernels built with NR_CPUS=8192, this can result in kmalloc failures
    that prevent mounting.

    There is work in progress to try to resolve this for every user of
    srcu_struct but using kvzalloc will work around the failures until
    that is complete.

    As an example with NR_CPUS=512 on x86_64: the overall size of
    subvol_srcu is 3460 bytes, fs_info is 6496.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • [ Upstream commit 18106734b512664a8541026519ce4b862498b6c3 ]

    When failing from ceph_fs_debugfs_init() in ceph_real_mount(),
    there is lack of dput of root_dentry and it causes slab errors,
    so change the calling order of ceph_fs_debugfs_init() and
    open_root_dentry() and do some cleanups to avoid this issue.

    Signed-off-by: Chengguang Xu
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chengguang Xu
     
  • [ Upstream commit 937441f3a3158d5510ca8cc78a82453f57a96365 ]

    When parsing string option, in order to avoid memory leak we need to
    carefully free it first in case of specifying same option several times.

    Signed-off-by: Chengguang Xu
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chengguang Xu
     
  • [ Upstream commit 8cc07c808c9d595e81cbe5aad419b7769eb2e5c9 ]

    i_dir_seq is subject to concurrent modification by a cmpxchg or
    store-release operation, so ensure that the relaxed access in
    d_alloc_parallel uses READ_ONCE.

    Reported-by: Peter Zijlstra
    Signed-off-by: Will Deacon
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • [ Upstream commit 015555fd4d2930bc0c86952c46ad88b3392f66e4 ]

    If d_alloc_parallel runs concurrently with __d_add, it is possible for
    d_alloc_parallel to continuously retry whilst i_dir_seq has been
    incremented to an odd value by __d_add:

    CPU0:
    __d_add
    n = start_dir_add(dir);
    cmpxchg(&dir->i_dir_seq, n, n + 1) == n

    CPU1:
    d_alloc_parallel
    retry:
    seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
    hlist_bl_lock(b);
    bit_spin_lock(0, (unsigned long *)b); // Always succeeds

    CPU0:
    __d_lookup_done(dentry)
    hlist_bl_lock
    bit_spin_lock(0, (unsigned long *)b); // Never succeeds

    CPU1:
    if (unlikely(parent->d_inode->i_dir_seq != seq)) {
    hlist_bl_unlock(b);
    goto retry;
    }

    Since the simple bit_spin_lock used to implement hlist_bl_lock does not
    provide any fairness guarantees, then CPU1 can starve CPU0 of the lock
    and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot
    exit its retry loop because the sequence number always has the bottom
    bit set.

    This patch resolves the livelock by not taking hlist_bl_lock in
    d_alloc_parallel if the sequence counter is odd, since any subsequent
    masked comparison with i_dir_seq will fail anyway.

    Cc: Peter Zijlstra
    Cc: Al Viro
    Reported-by: Naresh Madhusudana
    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Matthew Wilcox
    Signed-off-by: Will Deacon
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • [ Upstream commit ad86f605c59500da82d196ac312cfbac3daba31d ]

    nfs4_update_server unconditionally releases the nfs_client for the
    source server. If migration fails, this can cause the source server's
    nfs_client struct to be left with a low reference count, resulting in
    use-after-free. Also, adjust reference count handling for ELOOP.

    NFS: state manager: migration failed on NFSv4 server nfsvmu10 with error 6
    WARNING: CPU: 16 PID: 17960 at fs/nfs/client.c:281 nfs_put_client+0xfa/0x110 [nfs]()
    nfs_put_client+0xfa/0x110 [nfs]
    nfs4_run_state_manager+0x30/0x40 [nfsv4]
    kthread+0xd8/0xf0

    BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
    nfs4_xdr_enc_write+0x6b/0x160 [nfsv4]
    rpcauth_wrap_req+0xac/0xf0 [sunrpc]
    call_transmit+0x18c/0x2c0 [sunrpc]
    __rpc_execute+0xa6/0x490 [sunrpc]
    rpc_async_schedule+0x15/0x20 [sunrpc]
    process_one_work+0x160/0x470
    worker_thread+0x112/0x540
    ? rescuer_thread+0x3f0/0x3f0
    kthread+0xd8/0xf0

    This bug was introduced by 32e62b7c ("NFS: Add nfs4_update_server"),
    but the fix applies cleanly to 52442f9b ("NFS4: Avoid migration loops")

    Reported-by: Helen Chao
    Fixes: 52442f9b11b7 ("NFS4: Avoid migration loops")
    Signed-off-by: Bill Baker
    Reviewed-by: Chuck Lever
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Bill.Baker@oracle.com
     
  • commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream.

    For anything NFS-exported we do _not_ want to unlock new inode
    before it has grown an alias; original set of fixes got the
    ordering right, but missed the nasty complication in case of
    lockdep being enabled - unlock_new_inode() does
    lockdep_annotate_inode_mutex_key(inode)
    which can only be done before anyone gets a chance to touch
    ->i_mutex. Unfortunately, flipping the order and doing
    unlock_new_inode() before d_instantiate() opens a window when
    mkdir can race with open-by-fhandle on a guessed fhandle, leading
    to multiple aliases for a directory inode and all the breakage
    that follows from that.

    Correct solution: a new primitive (d_instantiate_new())
    combining these two in the right order - lockdep annotate, then
    d_instantiate(), then the rest of unlock_new_inode(). All
    combinations of d_instantiate() with unlock_new_inode() should
    be converted to that.

    Cc: stable@kernel.org # 2.6.29 and later
    Tested-by: Mike Marshall
    Reviewed-by: Andreas Dilger
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit baf10564fbb66ea222cae66fbff11c444590ffd9 upstream.

    kill_ioctx() used to have an explicit RCU delay between removing the
    reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
    At some point that delay had been removed, on the theory that
    percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
    the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
    by lookup_ioctx(). As the result, we could get ctx freed right under
    lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
    RCU grace period when freeing kioctx"); however, that fix is not enough.

    Suppose io_destroy() from one thread races with e.g. io_setup() from another;
    CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
    has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
    refcount, getting it to 0 and triggering a call of free_ioctx_users(),
    which proceeds to drop the secondary refcount and once that reaches zero
    calls free_ioctx_reqs(). That does
    INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
    queue_rcu_work(system_wq, &ctx->free_rwork);
    and schedules freeing the whole thing after RCU delay.

    In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
    refcount from 0 to 1 and returned the reference to io_setup().

    Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
    freed until after percpu_ref_get(). Sure, we'd increment the counter before
    ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
    stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
    has grabbed the reference, ctx is *NOT* going away until it gets around to
    dropping that reference.

    The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
    It's not costlier than what we currently do in normal case, it's safe to
    call since freeing *is* delayed and it closes the race window - either
    lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
    won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
    fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
    the object in question at all.

    Cc: stable@kernel.org
    Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 79f546a696bff2590169fb5684e23d65f4d9f591 upstream.

    We recently had an oops reported on a 4.14 kernel in
    xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
    and so the m_perag_tree lookup walked into lala land. It produces
    an oops down this path during the failed mount:

    radix_tree_gang_lookup_tag+0xc4/0x130
    xfs_perag_get_tag+0x37/0xf0
    xfs_reclaim_inodes_count+0x32/0x40
    xfs_fs_nr_cached_objects+0x11/0x20
    super_cache_count+0x35/0xc0
    shrink_slab.part.66+0xb1/0x370
    shrink_node+0x7e/0x1a0
    try_to_free_pages+0x199/0x470
    __alloc_pages_slowpath+0x3a1/0xd20
    __alloc_pages_nodemask+0x1c3/0x200
    cache_grow_begin+0x20b/0x2e0
    fallback_alloc+0x160/0x200
    kmem_cache_alloc+0x111/0x4e0

    The problem is that the superblock shrinker is running before the
    filesystem structures it depends on have been fully set up. i.e.
    the shrinker is registered in sget(), before ->fill_super() has been
    called, and the shrinker can call into the filesystem before
    fill_super() does it's setup work. Essentially we are exposed to
    both use-after-free and use-before-initialisation bugs here.

    To fix this, add a check for the SB_BORN flag in super_cache_count.
    In general, this flag is not set until ->fs_mount() completes
    successfully, so we know that it is set after the filesystem
    setup has completed. This matches the trylock_super() behaviour
    which will not let super_cache_scan() run if SB_BORN is not set, and
    hence will not allow the superblock shrinker from entering the
    filesystem while it is being set up or after it has failed setup
    and is being torn down.

    Cc: stable@kernel.org
    Signed-Off-By: Dave Chinner
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit 30da870ce4a4e007c901858a96e9e394a1daa74a upstream.

    we unlock the directory hash too early - if we are looking at secondary
    link and primary (in another directory) gets removed just as we unlock,
    we could have the old primary moved in place of the secondary, leaving
    us to look into freed entry (and leaving our dentry with ->d_fsdata
    pointing to a freed entry).

    Cc: stable@vger.kernel.org # 2.4.4+
    Acked-by: David Sterba
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

25 May, 2018

2 commits

  • commit 66072c29328717072fd84aaff3e070e3f008ba77 upstream.

    syzbot is reporting ODEBUG messages at hfsplus_fill_super() [1]. This
    is because hfsplus_fill_super() forgot to call cancel_delayed_work_sync().

    As far as I can see, it is hfsplus_mark_mdb_dirty() from
    hfsplus_new_inode() in hfsplus_fill_super() that calls
    queue_delayed_work(). Therefore, I assume that hfsplus_new_inode() does
    not fail if queue_delayed_work() was called, and the out_put_hidden_dir
    label is the appropriate location to call cancel_delayed_work_sync().

    [1] https://syzkaller.appspot.com/bug?id=a66f45e96fdbeb76b796bf46eb25ea878c42a6c9

    Link: http://lkml.kernel.org/r/964a8b27-cd69-357c-fe78-76b066056201@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Cc: Al Viro
    Cc: David Howells
    Cc: Ernesto A. Fernandez
    Cc: Vyacheslav Dubeyko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     
  • commit 5aa1437d2d9a068c0334bd7c9dafa8ec4f97f13b upstream.

    open file, unlink it, then use ioctl(2) to make it immutable or
    append only. Now close it and watch the blocks *not* freed...

    Immutable/append-only checks belong in ->setattr().
    Note: the bug is old and backport to anything prior to 737f2e93b972
    ("ext2: convert to use the new truncate convention") will need
    these checks lifted into ext2_setattr().

    Cc: stable@kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

23 May, 2018

2 commits

  • commit e96f46ee8587607a828f783daa6eb5b44d25004d upstream

    The style for the 'status' file is CamelCase or this. _.

    Fixes: fae1fa0fc ("proc: Provide details on speculation flaw mitigations")
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Konrad Rzeszutek Wilk
     
  • commit 356e4bfff2c5489e016fdb925adbf12a1e3950ee upstream

    For certain use cases it is desired to enforce mitigations so they cannot
    be undone afterwards. That's important for loader stubs which want to
    prevent a child from disabling the mitigation again. Will also be used for
    seccomp(). The extra state preserving of the prctl state for SSB is a
    preparatory step for EBPF dymanic speculation control.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner