20 Mar, 2018

1 commit

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit ef295ecf090d3e86e5b742fc6ab34f1122a43773)

    Conflicts:
    block/blk-mq.c
    include/linux/blk_types.h
    include/linux/blkdev.h

    Christoph Hellwig
     

18 Mar, 2018

3 commits

  • commit c4f24df942a181699c5bab01b8e5e82b925f77f3 upstream.

    We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to
    ensure that all outstanding COMMIT requests to the inode in question are
    complete. Currently we may exit early from both nfs_commit_inode() and
    nfs_write_inode() even if there are COMMIT requests in flight, or unstable
    writes on the commit list.

    In order to get the right semantics w.r.t. sync_inode(), we don't need
    to have nfs_commit_inode() reset the inode dirty flags when called from
    nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that
    nfs_write_inode() leaves them in the right state if there are outstanding
    commits, or stable pages.

    Reported-by: Scott Mayhew
    Fixes: dc4fd9ab01ab ("nfs: don't wait on commit in nfs_commit_inode()...")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit d9ee65539d3eabd9ade46cca1780e3309ad0f907 upstream.

    The start offset needs to be of type loff_t.

    Fixed: 5fadeb47dcc5c ("nfs: count DIO good bytes correctly with mirroring")
    Cc: stable@vger.kernel.org # v4.0+
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit ec00022030da5761518476096626338bd67df57a upstream.

    When an xattr block has a single reference, block is updated inplace
    and it is reinserted to the cache. Later, a cache lookup is performed
    to see whether an existing block has the same contents. This cache
    lookup will most of the time return the just inserted entry so
    deduplication is not achieved.

    Running the following test script will produce two xattr blocks which
    can be observed in "File ACL: " line of debugfs output:

    mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G
    mount /dev/sdb /mnt/sdb

    touch /mnt/sdb/{x,y}

    setfattr -n user.1 -v aaa /mnt/sdb/x
    setfattr -n user.2 -v bbb /mnt/sdb/x

    setfattr -n user.1 -v aaa /mnt/sdb/y
    setfattr -n user.2 -v bbb /mnt/sdb/y

    debugfs -R 'stat x' /dev/sdb | cat
    debugfs -R 'stat y' /dev/sdb | cat

    This patch defers the reinsertion to the cache so that we can locate
    other blocks with the same contents.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Andreas Dilger
    Signed-off-by: Tommi Rantala
    Signed-off-by: Greg Kroah-Hartman

    Tahsin Erdogan
     

11 Mar, 2018

1 commit

  • commit d7d824966530acfe32b94d1ed672e6fe1638cd68 upstream.

    When changing a file's acl mask, btrfs_set_acl() will first set the
    group bits of i_mode to the value of the mask, and only then set the
    actual extended attribute representing the new acl.

    If the second part fails (due to lack of space, for example) and the
    file had no acl attribute to begin with, the system will from now on
    assume that the mask permission bits are actual group permission bits,
    potentially granting access to the wrong users.

    Prevent this by restoring the original mode bits if __btrfs_set_acl
    fails.

    Signed-off-by: Ernesto A. Fernández
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Nikolay Borisov
    Signed-off-by: Greg Kroah-Hartman

    Ernesto A. Fernández
     

03 Mar, 2018

4 commits

  • [ Upstream commit 3a3882ff26fbdbaf5f7e13f6a0bccfbf7121041d ]

    xfs_qm_init_quotainfo() does not check result of register_shrinker()
    which was tagged as __must_check recently, reported by sparse.

    Signed-off-by: Aliaksei Karaliou
    [darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Aliaksei Karaliou
     
  • [ Upstream commit 2196881566225f3c3428d1a5f847a992944daa5b ]

    xfs_qm_destroy_quotainfo() does not destroy quotainfo->qi_tree_lock
    while destroys quotainfo->qi_quotaofflock.

    Signed-off-by: Aliaksei Karaliou
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Aliaksei Karaliou
     
  • [ Upstream commit 9ee332d99e4d5a97548943b81c54668450ce641b ]

    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit dad48e73127ba10279ea33e6dbc8d3905c4d31c0 upstream.

    Thread A: Thread B:

    -f2fs_remount
    -sbi->mount_opt.opt = 0;
    extent_tree is NULL
    -default_options && parse_options
    -remount return

    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Nikolay Borisov
    Signed-off-by: Greg Kroah-Hartman

    Yunlei He
     

28 Feb, 2018

1 commit

  • commit 1eb643d02b21412e603b42cdd96010a2ac31c05f upstream.

    dax_writeback_mapping_range() fails to update iteration index when
    searching radix tree for entries needing cache flushing. Thus each
    pagevec worth of entries is searched starting from the start which is
    inefficient and prone to livelocks. Update index properly.

    Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz
    Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync")
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

25 Feb, 2018

3 commits

  • When CONFIG_ELF_CORE is disabled, we get a harmless warning in the compat
    version of binfmt_elf:

    fs/compat_binfmt_elf.c:58:13: error: 'cputime_to_compat_timeval' defined but not used [-Werror=unused-function]

    This was addressed in mainline Linux as part of a larger rework with commit
    cd19c364b313 ("fs/binfmt: Convert obsolete cputime type to nsecs").

    For 4.9 and earlier, this just shuts up the warning by adding an #ifdef
    around the function definition.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • commit ab4949640d6674b617b314ad3c2c00353304bab9 upstream.

    The latest gcc-7.0.1 snapshot warns about an unintialized variable use:

    In file included from fs/reiserfs/lbalance.c:8:0:
    fs/reiserfs/lbalance.c: In function 'leaf_item_bottle.isra.3':
    fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset);
    ~~^~~
    fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset);

    This happens because the offset/type pair that is stored in
    ih.key.u.k_offset_v2 is actually uninitialized when we call
    set_le_ih_k_offset() and set_le_ih_k_type(). After we have called both,
    all data is correct, but the first of the two reads uninitialized data
    for the type field and writes it back before it gets overwritten.

    This works around the warning by initializing the k_offset_v2 through
    the slightly larger memcpy().

    [JK: Remove now unused define and make it obvious we initialize the key]

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • [ Upstream commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 ]

    The name char array passed to btrfs_search_path_in_tree is of size
    BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
    are in the range of [0, 4079]. Currently the code uses the define but this
    represents an off-by-one.

    Implications:

    Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
    written to extra space, not some padding that could be provided by the
    allocator.

    btrfs-progs store the arguments on stack, but kernel does own copy of
    the ioctl buffer and the off-by-one overwrite does not affect userspace,
    but the ending 0 might be lost.

    Kernel ioctl buffer is allocated dynamically so we're overwriting
    somebody else's memory, and the ioctl is privileged if args.objectid is
    not 256. Which is in most cases, but resolving a subvolume stored in
    another directory will trigger that path.

    Before this patch the buffer was one byte larger, but then the -1 was
    not added.

    Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    [ added implications ]
    Signed-off-by: David Sterba

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     

22 Feb, 2018

11 commits

  • commit c0eb027e5aef70b71e5a38ee3e264dc0b497f343 upstream.

    Normal pathname lookup doesn't allow empty pathnames, but using
    AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
    can trigger an empty pathname lookup.

    And not only is the RCU lookup in that case entirely unnecessary
    (because we'll obviously immediately finalize the end result), it is
    actively wrong.

    Why? An empth path is a special case that will return the original
    'dirfd' dentry - and that dentry may not actually be RCU-free'd,
    resulting in a potential use-after-free if we were to initialize the
    path lazily under the RCU read lock and depend on complete_walk()
    finalizing the dentry.

    Found by syzkaller and KASAN.

    Reported-by: Dmitry Vyukov
    Reported-by: Vegard Nossum
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Cc: Eric Biggers
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit ff26cc10aec128c3f86b5611fd5f59c71d49c0e3 upstream.

    If we can't get inode lock immediately in the function
    ocfs2_inode_lock_with_page() when reading a page, we should not return
    directly here, since this will lead to a softlockup problem when the
    kernel is configured with CONFIG_PREEMPT is not set. The method is to
    get a blocking lock and immediately unlock before returning, this can
    avoid CPU resource waste due to lots of retries, and benefits fairness
    in getting lock among multiple nodes, increase efficiency in case
    modifying the same file frequently from multiple nodes.

    The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
    looks like:

    Kernel panic - not syncing: softlockup: hung tasks
    CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:

    dump_stack+0x5c/0x82
    panic+0xd5/0x21e
    watchdog_timer_fn+0x208/0x210
    __hrtimer_run_queues+0xcc/0x200
    hrtimer_interrupt+0xa6/0x1f0
    smp_apic_timer_interrupt+0x34/0x50
    apic_timer_interrupt+0x96/0xa0

    RIP: 0010:unlock_page+0x17/0x30
    RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
    RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
    RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
    RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
    R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
    R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
    ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
    ocfs2_readpage+0x41/0x2d0 [ocfs2]
    filemap_fault+0x12b/0x5c0
    ocfs2_fault+0x29/0xb0 [ocfs2]
    __do_fault+0x1a/0xa0
    __handle_mm_fault+0xbe8/0x1090
    handle_mm_fault+0xaa/0x1f0
    __do_page_fault+0x235/0x4b0
    trace_do_page_fault+0x3c/0x110
    async_page_fault+0x28/0x30
    RIP: 0033:0x7fa75ded638e
    RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
    RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
    RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
    RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
    R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
    R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

    About performance improvement, we can see the testing time is reduced,
    and CPU utilization decreases, the detailed data is as follows. I ran
    multi_mmap test case in ocfs2-test package in a three nodes cluster.

    Before applying this patch:
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
    1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
    5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
    95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
    2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
    2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
    2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun

    ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
    Tests with "-b 4096 -C 32768"
    Thu Dec 28 14:44:52 CST 2017
    multi_mmap..................................................Passed.
    Runtime 783 seconds.

    After apply this patch:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
    155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
    95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
    2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
    5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
    2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
    299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
    335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
    535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
    1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync

    ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
    Tests with "-b 4096 -C 32768"
    Thu Dec 28 15:04:12 CST 2017
    multi_mmap..................................................Passed.
    Runtime 487 seconds.

    Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
    Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
    Signed-off-by: Gang He
    Reviewed-by: Eric Ren
    Acked-by: alex chen
    Acked-by: piaojun
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gang He
     
  • commit 900c9981680067573671ecc5cbfa7c5770be3a40 upstream.

    The highest objectid, which is assigned to new inode, is decided at
    the time of initializing fs roots. However, in cases where log replay
    gets processed, the btree which fs root owns might be changed, so we
    have to search it again for the highest objectid, otherwise creating
    new inode would end up with -EEXIST.

    cc: v4.4-rc6+
    Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots")
    Signed-off-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • commit e8f1bc1493855e32b7a2a019decc3c353d94daf6 upstream.

    This regression is introduced in
    commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction").

    There are two problems,

    a) it is ->destroy_inode() that does the final free on inode, not
    ->evict_inode(),
    b) clear_inode() must be called before ->evict_inode() returns.

    This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
    in evict() because I_CLEAR is set in clear_inode().

    Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction")
    Cc: # v4.7-rc6+
    Signed-off-by: Liu Bo
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • commit 55237a5f2431a72435e3ed39e4306e973c0446b7 upstream.

    It's possible that btrfs_sync_log() bails out after one of the two
    btrfs_write_marked_extents() which convert extent state's state bit into
    EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY
    and EXTENT_NEW are searched by free_log_tree() so that those extent states
    with EXTENT_NEED_WAIT lead to memory leak.

    cc:
    Signed-off-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • commit 1846430c24d66e85cc58286b3319c82cd54debb2 upstream.

    In cases that the whole fs flips into readonly status due to failures in
    critical sections, then log tree's blocks are still dirty, and this leads
    to a crash during umount time, the crash is about use-after-free,

    umount
    -> close_ctree
    -> stop workers
    -> iput(btree_inode)
    -> iput_final
    -> write_inode_now
    -> ...
    -> queue job on stop'd workers

    cc: v3.12+
    Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error")
    Signed-off-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • commit e89166990f11c3f21e1649d760dd35f9e410321c upstream.

    @cur_offset is not set back to what it should be (@cow_start) if
    btrfs_next_leaf() returns something wrong, and the range [cow_start,
    cur_offset) remains locked forever.

    cc:
    Signed-off-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • commit 06f29cc81f0350261f59643a505010531130eea0 upstream.

    In the function __ext4_grp_locked_error(), __save_error_info()
    is called to save error info in super block block, but does not sync
    that information to disk to info the subsequence fsck after reboot.

    This patch writes the error information to disk. After this patch,
    I think there is no obvious EXT4 error handle branches which leads to
    "Remounting filesystem read-only" will leave the disk partition miss
    the subsequence fsck.

    Signed-off-by: Zhouyi Zhou
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Zhouyi Zhou
     
  • commit abbc3f9395c76d554a9ed27d4b1ebfb5d9b0e4ca upstream.

    This patch fixes a race between the shutdown path and bio completion
    handling. In the ext4 direct io path with async io, after submitting a
    bio to the block layer, if journal starting fails,
    ext4_direct_IO_write() would bail out pretending that the IO
    failed. The caller would have had no way of knowing whether or not the
    IO was successfully submitted. So instead, we return -EIOCBQUEUED in
    this case. Now, the caller knows that the IO was submitted. The bio
    completion handler takes care of the error.

    Tested: Ran the shutdown xfstest test 461 in loop for over 2 hours across
    4 machines resulting in over 400 runs. Verified that the race didn't
    occur. Usually the race was seen in about 20-30 iterations.

    Signed-off-by: Harshad Shirwadkar
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Harshad Shirwadkar
     
  • commit f69120ce6c024aa634a8fc25787205e42f0ccbe6 upstream.

    Sphinx emits various (26) warnings when building make target 'htmldocs'.
    Currently struct definitions contain duplicate documentation, some as
    kernel-docs and some as standard c89 comments. We can reduce
    duplication while cleaning up the kernel docs.

    Move all kernel-docs to right above each struct member. Use the set of
    all existing comments (kernel-doc and c89). Add documentation for
    missing struct members and function arguments.

    Signed-off-by: Tobin C. Harding
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Tobin C. Harding
     
  • commit 3876bbe27d04b848750d5310a37d6b76b593f648 upstream.

    KMSAN reported use of uninitialized |entry->e_referenced| in a condition
    in mb_cache_shrink():

    ==================================================================
    BUG: KMSAN: use of uninitialized memory in mb_cache_shrink+0x3b4/0xc50 fs/mbcache.c:287
    CPU: 2 PID: 816 Comm: kswapd1 Not tainted 4.11.0-rc5+ #2877
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
    01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x172/0x1c0 lib/dump_stack.c:52
    kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:927
    __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:469
    mb_cache_shrink+0x3b4/0xc50 fs/mbcache.c:287
    mb_cache_scan+0x67/0x80 fs/mbcache.c:321
    do_shrink_slab mm/vmscan.c:397 [inline]
    shrink_slab+0xc3d/0x12d0 mm/vmscan.c:500
    shrink_node+0x208f/0x2fd0 mm/vmscan.c:2603
    kswapd_shrink_node mm/vmscan.c:3172 [inline]
    balance_pgdat mm/vmscan.c:3289 [inline]
    kswapd+0x160f/0x2850 mm/vmscan.c:3478
    kthread+0x46c/0x5f0 kernel/kthread.c:230
    ret_from_fork+0x29/0x40 arch/x86/entry/entry_64.S:430
    chained origin:
    save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 [inline]
    kmsan_save_stack mm/kmsan/kmsan.c:317 [inline]
    kmsan_internal_chain_origin+0x12a/0x1f0 mm/kmsan/kmsan.c:547
    __msan_store_shadow_origin_1+0xac/0x110 mm/kmsan/kmsan_instr.c:257
    mb_cache_entry_create+0x3b3/0xc60 fs/mbcache.c:95
    ext4_xattr_cache_insert fs/ext4/xattr.c:1647 [inline]
    ext4_xattr_block_set+0x4c82/0x5530 fs/ext4/xattr.c:1022
    ext4_xattr_set_handle+0x1332/0x20a0 fs/ext4/xattr.c:1252
    ext4_xattr_set+0x4d2/0x680 fs/ext4/xattr.c:1306
    ext4_xattr_trusted_set+0x8d/0xa0 fs/ext4/xattr_trusted.c:36
    __vfs_setxattr+0x703/0x790 fs/xattr.c:149
    __vfs_setxattr_noperm+0x27a/0x6f0 fs/xattr.c:180
    vfs_setxattr fs/xattr.c:223 [inline]
    setxattr+0x6ae/0x790 fs/xattr.c:449
    path_setxattr+0x1eb/0x380 fs/xattr.c:468
    SYSC_lsetxattr+0x8d/0xb0 fs/xattr.c:490
    SyS_lsetxattr+0x77/0xa0 fs/xattr.c:486
    entry_SYSCALL_64_fastpath+0x13/0x94
    origin:
    save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 [inline]
    kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:198
    kmsan_kmalloc+0x7f/0xe0 mm/kmsan/kmsan.c:337
    kmem_cache_alloc+0x1c2/0x1e0 mm/slub.c:2766
    mb_cache_entry_create+0x283/0xc60 fs/mbcache.c:86
    ext4_xattr_cache_insert fs/ext4/xattr.c:1647 [inline]
    ext4_xattr_block_set+0x4c82/0x5530 fs/ext4/xattr.c:1022
    ext4_xattr_set_handle+0x1332/0x20a0 fs/ext4/xattr.c:1252
    ext4_xattr_set+0x4d2/0x680 fs/ext4/xattr.c:1306
    ext4_xattr_trusted_set+0x8d/0xa0 fs/ext4/xattr_trusted.c:36
    __vfs_setxattr+0x703/0x790 fs/xattr.c:149
    __vfs_setxattr_noperm+0x27a/0x6f0 fs/xattr.c:180
    vfs_setxattr fs/xattr.c:223 [inline]
    setxattr+0x6ae/0x790 fs/xattr.c:449
    path_setxattr+0x1eb/0x380 fs/xattr.c:468
    SYSC_lsetxattr+0x8d/0xb0 fs/xattr.c:490
    SyS_lsetxattr+0x77/0xa0 fs/xattr.c:486
    entry_SYSCALL_64_fastpath+0x13/0x94
    ==================================================================

    Signed-off-by: Alexander Potapenko
    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org # v4.6
    Signed-off-by: Greg Kroah-Hartman

    Alexander Potapenko
     

17 Feb, 2018

16 commits

  • commit d796e77f1dd541fe34481af2eee6454688d13982 upstream.

    As a writable mount, it is not expected for overlayfs to return
    EINVAL/EROFS for fsync, even if dir/file is not changed.

    This commit fixes the case of fsync of directory, which is easier to
    address, because overlayfs already implements fsync file operation for
    directories.

    The problem reported by Raphael is that new PostgreSQL 10.0 with a
    database in overlayfs where lower layer in squashfs fails to start.
    The failure is due to fsync error, when PostgreSQL does fsync on all
    existing db directories on startup and a specific directory exists
    lower layer with no changes.

    Reported-by: Raphael Hertzog
    Signed-off-by: Amir Goldstein
    Tested-by: Raphaël Hertzog
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream.

    This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers
    to deal with pages that have been improperly dirtied") and it didn't do
    any error handling then. This function might very well fail in ENOMEM
    situation, yet it's not handled, this could lead to inconsistent state.
    So let's handle the failure by setting the mapping error bit.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • commit 9903a91c763ecdae333a04a9d89d79d2b8966503 upstream.

    With pipe-user-pages-hard set to 'N', users were actually only allowed up
    to 'N - 1' buffers; and likewise for pipe-user-pages-soft.

    Fix this to allow up to 'N' buffers, as would be expected.

    Link: http://lkml.kernel.org/r/20180111052902.14409-5-ebiggers3@gmail.com
    Fixes: b0b91d18e2e9 ("pipe: fix limit checking in pipe_set_size()")
    Signed-off-by: Eric Biggers
    Acked-by: Willy Tarreau
    Acked-by: Kees Cook
    Acked-by: Joe Lawrence
    Cc: Alexander Viro
    Cc: "Luis R . Rodriguez"
    Cc: Michael Kerrisk
    Cc: Mikulas Patocka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 85c2dd5473b2718b4b63e74bfeb1ca876868e11f upstream.

    pipe-user-pages-hard and pipe-user-pages-soft are only supposed to apply
    to unprivileged users, as documented in both Documentation/sysctl/fs.txt
    and the pipe(7) man page.

    However, the capabilities are actually only checked when increasing a
    pipe's size using F_SETPIPE_SZ, not when creating a new pipe. Therefore,
    if pipe-user-pages-hard has been set, the root user can run into it and be
    unable to create pipes. Similarly, if pipe-user-pages-soft has been set,
    the root user can run into it and have their pipes limited to 1 page each.

    Fix this by allowing the privileged override in both cases.

    Link: http://lkml.kernel.org/r/20180111052902.14409-4-ebiggers3@gmail.com
    Fixes: 759c01142a5d ("pipe: limit the per-user amount of pages allocated in pipes")
    Signed-off-by: Eric Biggers
    Acked-by: Kees Cook
    Acked-by: Joe Lawrence
    Cc: Alexander Viro
    Cc: "Luis R . Rodriguez"
    Cc: Michael Kerrisk
    Cc: Mikulas Patocka
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit d0290bc20d4739b7a900ae37eb5d4cc3be2b393f upstream.

    Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext
    data") added a bounce buffer to avoid hardened usercopy checks. Copying
    to the bounce buffer was implemented with a simple memcpy() assuming
    that it is always valid to read from kernel memory iff the
    kern_addr_valid() check passed.

    A simple, but pointless, test case like "dd if=/proc/kcore of=/dev/null"
    now can easily crash the kernel, since the former execption handling on
    invalid kernel addresses now doesn't work anymore.

    Also adding a kern_addr_valid() implementation wouldn't help here. Most
    architectures simply return 1 here, while a couple implemented a page
    table walk to figure out if something is mapped at the address in
    question.

    With DEBUG_PAGEALLOC active mappings are established and removed all the
    time, so that relying on the result of kern_addr_valid() before
    executing the memcpy() also doesn't work.

    Therefore simply use probe_kernel_read() to copy to the bounce buffer.
    This also allows to simplify read_kcore().

    At least on s390 this fixes the observed crashes and doesn't introduce
    warnings that were removed with df04abfd181a ("fs/proc/kcore.c: Add
    bounce buffer for ktext data"), even though the generic
    probe_kernel_read() implementation uses uaccess functions.

    While looking into this I'm also wondering if kern_addr_valid() could be
    completely removed...(?)

    Link: http://lkml.kernel.org/r/20171202132739.99971-1-heiko.carstens@de.ibm.com
    Fixes: df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Signed-off-by: Heiko Carstens
    Acked-by: Kees Cook
    Cc: Jiri Olsa
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Heiko Carstens
     
  • commit 073c516ff73557a8f7315066856c04b50383ac34 upstream.

    Andrey reported a use-after-free in __ns_get_path():

    spin_lock include/linux/spinlock.h:299 [inline]
    lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
    __ns_get_path+0x197/0x860 fs/nsfs.c:66
    open_related_ns+0xda/0x200 fs/nsfs.c:143
    sock_ioctl+0x39d/0x440 net/socket.c:1001
    vfs_ioctl fs/ioctl.c:45 [inline]
    do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
    SYSC_ioctl fs/ioctl.c:700 [inline]
    SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

    We are under rcu read lock protection at that point:

    rcu_read_lock();
    d = atomic_long_read(&ns->stashed);
    if (!d)
    goto slow;
    dentry = (struct dentry *)d;
    if (!lockref_get_not_dead(&dentry->d_lockref))
    goto slow;
    rcu_read_unlock();

    but don't use a proper RCU API on the free path, therefore a parallel
    __d_free() could free it at the same time. We need to mark the stashed
    dentry with DCACHE_RCUACCESS so that __d_free() will be called after all
    readers leave RCU.

    Fixes: e149ed2b805f ("take the targets of /proc/*/ns/* symlinks to separate fs")
    Cc: Alexander Viro
    Cc: Andrew Morton
    Reported-by: Andrey Konovalov
    Signed-off-by: Cong Wang
    Signed-off-by: Linus Torvalds
    Cc: Eric Biggers
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • commit ba87977a49913129962af8ac35b0e13e0fa4382d upstream.

    Commit b7ce40cff0b9 ("kernfs: cache atomic_write_len in
    kernfs_open_file") changes type of local variable 'len' from ssize_t
    to size_t. This change caused that the *ppos value is updated also
    when the previous write callback failed.

    Mentioned snippet:
    ...
    len = ops->write(...); 0)
    Signed-off-by: Ivan Vecera
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Ivan Vecera
     
  • commit e231c6879cfd44e4fffd384bb6dd7d313249a523 upstream.

    When locking the file in order to do O_DIRECT on it, we must unmap
    any mmapped ranges on the pagecache so that we can flush out the
    dirty data.

    Fixes: a5864c999de67 ("NFS: Do not serialise O_DIRECT reads and writes")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 49686cbbb3ebafe42e63868222f269d8053ead00 upstream.

    nfs_idmap_legacy_upcall() is supposed to be called with 'aux' pointing
    to a 'struct idmap', via the call to request_key_with_auxdata() in
    nfs_idmap_request_key().

    However it can also be reached via the request_key() system call in
    which case 'aux' will be NULL, causing a NULL pointer dereference in
    nfs_idmap_prepare_pipe_upcall(), assuming that the key description is
    valid enough to get that far.

    Fix this by making nfs_idmap_legacy_upcall() negate the key if no
    auxdata is provided.

    As usual, this bug was found by syzkaller. A simple reproducer using
    the command-line keyctl program is:

    keyctl request2 id_legacy uid:0 '' @s

    Fixes: 57e62324e469 ("NFS: Store the legacy idmapper result in the keyring")
    Reported-by: syzbot+5dfdbcf7b3eb5912abbb@syzkaller.appspotmail.com
    Signed-off-by: Eric Biggers
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 1b8d97b0a837beaf48a8449955b52c650a7114b4 upstream.

    If some of the WRITE calls making up an O_DIRECT write syscall fail,
    we neglect to commit, even if some of the WRITEs succeed.

    We also depend on the commit code to free the reference count on the
    nfs_page taken in the "if (request_commit)" case at the end of
    nfs_direct_write_completion(). The problem was originally noticed
    because ENOSPC's encountered partway through a write would result in a
    closed file being sillyrenamed when it should have been unlinked.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    J. Bruce Fields
     
  • commit 7f1bda447c9bd48b415acedba6b830f61591601f upstream.

    The commit list can get very large, and so we need a cond_resched()
    in nfs_commit_release_pages() in order to ensure we don't hog the CPU
    for excessive periods of time.

    Reported-by: Mike Galbraith
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit ba4a76f703ab7eb72941fdaac848502073d6e9ee upstream.

    Currently when falling back to doing I/O through the MDS (via
    pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
    without releasing the reference taken on the dreq
    via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
    nfs_direct_pgio_init. It then takes another reference on the dreq via
    nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
    as a result the requester will become stuck in inode_dio_wait. Once
    that happens, other processes accessing the inode will become stuck as
    well.

    Ensure that pnfs_read_through_mds() and pnfs_write_through_mds() clean
    up correctly by calling hdr->completion_ops->completion() instead of
    calling hdr->release() directly.

    This can be reproduced (sometimes) by performing "storage failover
    takeover" commands on NetApp filer while doing direct I/O from a client.

    This can also be reproduced using SystemTap to simulate a failure while
    doing direct I/O from a client (from Dave Wysochanski
    ):

    stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }'

    Suggested-by: Trond Myklebust
    Signed-off-by: Scott Mayhew
    Fixes: 1ca018d28d ("pNFS: Fix a memory leak when attempted pnfs fails")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Scott Mayhew
     
  • commit d8db5b1ca9d4c57e49893d0f78e6d5ce81450cc8 upstream.

    The inode is not locked in init_xattrs when creating a new inode.

    Without this patch, there will occurs assert when booting or creating
    a new file, if the kernel config CONFIG_SECURITY_SMACK is enabled.

    Log likes:

    UBIFS assert failed in ubifs_xattr_set at 298 (pid 1156)
    CPU: 1 PID: 1156 Comm: ldconfig Tainted: G S 4.12.0-rc1-207440-g1e70b02 #2
    Hardware name: MediaTek MT2712 evaluation board (DT)
    Call trace:
    [] dump_backtrace+0x0/0x238
    [] show_stack+0x14/0x20
    [] dump_stack+0x9c/0xc0
    [] ubifs_xattr_set+0x374/0x5e0
    [] init_xattrs+0x5c/0xb8
    [] security_inode_init_security+0x110/0x190
    [] ubifs_init_security+0x30/0x68
    [] ubifs_mkdir+0x100/0x200
    [] vfs_mkdir+0x11c/0x1b8
    [] SyS_mkdirat+0x74/0xd0
    [] __sys_trace_return+0x0/0x4

    Signed-off-by: Xiaolei Li
    Signed-off-by: Richard Weinberger
    Cc: stable@vger.kernel.org
    (julia: massaged to apply to 4.9.y, which doesn't contain fscrypto support)
    Signed-off-by: Julia Cartwright
    Signed-off-by: Greg Kroah-Hartman

    Xiaolei Li
     
  • commit 97f4b7276b829a8927ac903a119bef2f963ccc58 upstream.

    also replaces memset()+kfree() by kzfree().

    Signed-off-by: Aurelien Aptel
    Signed-off-by: Steve French
    Reviewed-by: Pavel Shilovsky
    Signed-off-by: Greg Kroah-Hartman

    Aurelien Aptel
     
  • commit 9aca7e454415f7878b28524e76bebe1170911a88 upstream.

    Autonegotiation gives a security settings mismatch error if the SMB
    server selects an SMBv3 dialect that isn't SMB3.02. The exact error is
    "protocol revalidation - security settings mismatch".
    This can be tested using Samba v4.2 or by setting the global Samba
    setting max protocol = SMB3_00.

    The check that fails in smb3_validate_negotiate is the dialect
    verification of the negotiate info response. This is because it tries
    to verify against the protocol_id in the global smbdefault_values. The
    protocol_id in smbdefault_values is SMB3.02.
    In SMB2_negotiate the protocol_id in smbdefault_values isn't updated,
    it is global so it probably shouldn't be, but server->dialect is.

    This patch changes the check in smb3_validate_negotiate to use
    server->dialect instead of server->vals->protocol_id. The patch works
    with autonegotiate and when using a specific version in the vers mount
    option.

    Signed-off-by: Daniel N Pettersson
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Daniel N Pettersson
     
  • commit f04a703c3d613845ae3141bfaf223489de8ab3eb upstream.

    If cifs_zap_mapping() returned an error, we would return without putting
    the xid that we got earlier. Restructure cifs_file_strict_mmap() and
    cifs_file_mmap() to be more similar to each other and have a single
    point of return that always puts the xid.

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox