30 Jul, 2018

3 commits

  • Pull ext4 fixes from Ted Ts'o:
    "Some miscellaneous ext4 fixes for 4.18; one fix is for a regression
    introduced in 4.18-rc4.

    Sorry for the late-breaking pull. I was originally going to wait for
    the next merge window, but Eric Whitney found a regression introduced
    in 4.18-rc4, so I decided to push out the regression plus the other
    fixes now. (The other commits have been baking in linux-next since
    early July)"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix check to prevent initializing reserved inodes
    ext4: check for allocation block validity with block group locked
    ext4: fix inline data updates with checksums enabled
    ext4: clear mmp sequence number when remounting read-only
    ext4: fix false negatives *and* false positives in ext4_check_descriptors()

    Linus Torvalds
     
  • Anatoly Trosinenko reports that a corrupted squashfs image can cause a
    kernel oops. It turns out that squashfs can end up being confused about
    negative fragment lengths.

    The regular squashfs_read_data() does check for negative lengths, but
    squashfs_read_metadata() did not, and the fragment size code just
    blindly trusted the on-disk value. Fix both the fragment parsing and
    the metadata reading code.

    Reported-by: Anatoly Trosinenko
    Cc: Al Viro
    Cc: Phillip Lougher
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit 8844618d8aa7: "ext4: only look at the bg_flags field if it is
    valid" will complain if block group zero does not have the
    EXT4_BG_INODE_ZEROED flag set. Unfortunately, this is not correct,
    since a freshly created file system has this flag cleared. It gets
    almost immediately after the file system is mounted read-write --- but
    the following somewhat unlikely sequence will end up triggering a
    false positive report of a corrupted file system:

    mkfs.ext4 /dev/vdc
    mount -o ro /dev/vdc /vdc
    mount -o remount,rw /dev/vdc

    Instead, when initializing the inode table for block group zero, test
    to make sure that itable_unused count is not too large, since that is
    the case that will result in some or all of the reserved inodes
    getting cleared.

    This fixes the failures reported by Eric Whiteney when running
    generic/230 and generic/231 in the the nojournal test case.

    Fixes: 8844618d8aa7 ("ext4: only look at the bg_flags field if it is valid")
    Reported-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

28 Jul, 2018

3 commits

  • Pull block fixes from Jens Axboe:
    "Bigger than usual at this time, mostly due to the O_DIRECT corruption
    issue and the fact that I was on vacation last week. This contains:

    - NVMe pull request with two fixes for the FC code, and two target
    fixes (Christoph)

    - a DIF bio reset iteration fix (Greg Edwards)

    - two nbd reply and requeue fixes (Josef)

    - SCSI timeout fixup (Keith)

    - a small series that fixes an issue with bio_iov_iter_get_pages(),
    which ended up causing corruption for larger sized O_DIRECT writes
    that ended up racing with buffered writes (Martin Wilck)"

    * tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
    block: reset bi_iter.bi_done after splitting bio
    block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
    blkdev: __blkdev_direct_IO_simple: fix leak in error case
    block: bio_iov_iter_get_pages: fix size of last iovec
    nvmet: only check for filebacking on -ENOTBLK
    nvmet: fixup crash on NULL device path
    scsi: set timed out out mq requests to complete
    blk-mq: export setting request completion state
    nvme: if_ready checks to fail io to deleting controller
    nvmet-fc: fix target sgl list on large transfers
    nbd: handle unexpected replies better
    nbd: don't requeue the same request twice.

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "11 fixes"

    * emailed patches from Andrew Morton :
    kvm, mm: account shadow page tables to kmemcg
    zswap: re-check zswap_is_full() after do zswap_shrink()
    include/linux/eventfd.h: include linux/errno.h
    mm: fix vma_is_anonymous() false-positives
    mm: use vma_init() to initialize VMAs on stack and data segments
    mm: introduce vma_init()
    mm: fix exports that inadvertently make put_page() EXPORT_SYMBOL_GPL
    ipc/sem.c: prevent queue.status tearing in semop
    mm: disallow mappings that conflict for devm_memremap_pages()
    kasan: only select SLUB_DEBUG with SYSFS=y
    delayacct: fix crash in delayacct_blkio_end() after delayacct init failure

    Linus Torvalds
     
  • Pull xfs fixes from Darrick Wong:

    - Fix some uninitialized variable errors

    - Fix an incorrect check in metadata verifiers

    * tag 'xfs-4.18-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: properly handle free inodes in extent hint validators
    xfs: Initialize variables in xfs_alloc_get_rec before using them

    Linus Torvalds
     

27 Jul, 2018

3 commits

  • vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
    VMA. This is unreliable as ->mmap may not set ->vm_ops.

    False-positive vma_is_anonymous() may lead to crashes:

    next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
    prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
    pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
    flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
    ------------[ cut here ]------------
    kernel BUG at mm/memory.c:1422!
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
    01/01/2011
    RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
    RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
    RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
    RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
    Call Trace:
    unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
    zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
    unmap_mapping_range_vma mm/memory.c:2792 [inline]
    unmap_mapping_range_tree mm/memory.c:2813 [inline]
    unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
    unmap_mapping_range+0x48/0x60 mm/memory.c:2880
    truncate_pagecache+0x54/0x90 mm/truncate.c:800
    truncate_setsize+0x70/0xb0 mm/truncate.c:826
    simple_setattr+0xe9/0x110 fs/libfs.c:409
    notify_change+0xf13/0x10f0 fs/attr.c:335
    do_truncate+0x1ac/0x2b0 fs/open.c:63
    do_sys_ftruncate+0x492/0x560 fs/open.c:205
    __do_sys_ftruncate fs/open.c:215 [inline]
    __se_sys_ftruncate fs/open.c:213 [inline]
    __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reproducer:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
    #define KCOV_ENABLE _IO('c', 100)
    #define KCOV_DISABLE _IO('c', 101)
    #define COVER_SIZE (1024<<< 20);
    return 0;
    }

    This can be fixed by assigning anonymous VMAs own vm_ops and not relying
    on it being NULL.

    If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
    dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.

    Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Make sure to initialize all VMAs properly, not only those which come
    from vm_area_cachep.

    Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe

    Martin Wilck
     

26 Jul, 2018

1 commit

  • …git/dhowells/linux-fs

    Pull fscache/cachefiles fixes from David Howells:

    - Allow cancelled operations to be queued so they can be cleaned up.

    - Fix a refcounting bug in the monitoring of reads on backend files
    whereby a race can occur between monitor objects being listed for
    work, the work processing being queued and the work processor running
    and destroying the monitor objects.

    - Fix a ref overput in object attachment, whereby a tentatively
    considered object is put in error handling without first being 'got'.

    - Fix a missing clear of the CACHEFILES_OBJECT_ACTIVE flag whereby an
    assertion occurs when we retry because it seems the object is now
    active.

    - Wait rather BUG'ing on an object collision in the depths of
    cachefiles as the active object should be being cleaned up - also
    depends on the one above.

    * tag 'fscache-fixes-20180725' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    cachefiles: Wait rather than BUG'ing on "Unexpected object collision"
    cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag
    fscache: Fix reference overput in fscache_attach_object() error handling
    cachefiles: Fix refcounting bug in backing-file read monitoring
    fscache: Allow cancelled operations to be enqueued

    Linus Torvalds
     

25 Jul, 2018

6 commits

  • If we meet a conflicting object that is marked FSCACHE_OBJECT_IS_LIVE in
    the active object tree, we have been emitting a BUG after logging
    information about it and the new object.

    Instead, we should wait for the CACHEFILES_OBJECT_ACTIVE flag to be cleared
    on the old object (or return an error). The ACTIVE flag should be cleared
    after it has been removed from the active object tree. A timeout of 60s is
    used in the wait, so we shouldn't be able to get stuck there.

    Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem")
    Signed-off-by: Kiran Kumar Modukuri
    Signed-off-by: David Howells

    Kiran Kumar Modukuri
     
  • In cachefiles_mark_object_active(), the new object is marked active and
    then we try to add it to the active object tree. If a conflicting object
    is already present, we want to wait for that to go away. After the wait,
    we go round again and try to re-mark the object as being active - but it's
    already marked active from the first time we went through and a BUG is
    issued.

    Fix this by clearing the CACHEFILES_OBJECT_ACTIVE flag before we try again.

    Analysis from Kiran Kumar Modukuri:

    [Impact]
    Oops during heavy NFS + FSCache + Cachefiles

    CacheFiles: Error: Overlong wait for old active object to go away.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000002

    CacheFiles: Error: Object already active kernel BUG at
    fs/cachefiles/namei.c:163!

    [Cause]
    In a heavily loaded system with big files being read and truncated, an
    fscache object for a cookie is being dropped and a new object being
    looked. The new object being looked for has to wait for the old object
    to go away before the new object is moved to active state.

    [Fix]
    Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when
    retrying the object lookup.

    [Testcase]
    Have run ~100 hours of NFS stress tests and have not seen this bug recur.

    [Regression Potential]
    - Limited to fscache/cachefiles.

    Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem")
    Signed-off-by: Kiran Kumar Modukuri
    Signed-off-by: David Howells

    Kiran Kumar Modukuri
     
  • When a cookie is allocated that causes fscache_object structs to be
    allocated, those objects are initialised with the cookie pointer, but
    aren't blessed with a ref on that cookie unless the attachment is
    successfully completed in fscache_attach_object().

    If attachment fails because the parent object was dying or there was a
    collision, fscache_attach_object() returns without incrementing the cookie
    counter - but upon failure of this function, the object is released which
    then puts the cookie, whether or not a ref was taken on the cookie.

    Fix this by taking a ref on the cookie when it is assigned in
    fscache_object_init(), even when we're creating a root object.

    Analysis from Kiran Kumar:

    This bug has been seen in 4.4.0-124-generic #148-Ubuntu kernel

    BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776277

    fscache cookie ref count updated incorrectly during fscache object
    allocation resulting in following Oops.

    kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
    kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!

    [Cause]
    Two threads are trying to do operate on a cookie and two objects.

    (1) One thread tries to unmount the filesystem and in process goes over a
    huge list of objects marking them dead and deleting the objects.
    cookie->usage is also decremented in following path:

    nfs_fscache_release_super_cookie
    -> __fscache_relinquish_cookie
    ->__fscache_cookie_put
    ->BUG_ON(atomic_read(&cookie->usage) fscache_object_init
    -> assign cookie, but usage not bumped.
    2) fscache_attach_object -> fails in cant_attach_object because the
    cookie's backing object or cookie's->parent object are going away
    3) fscache_put_object
    -> cachefiles_put_object
    ->fscache_object_destroy
    ->fscache_cookie_put
    ->BUG_ON(atomic_read(&cookie->usage)
    Signed-off-by: David Howells

    Kiran Kumar Modukuri
     
  • cachefiles_read_waiter() has the right to access a 'monitor' object by
    virtue of being called under the waitqueue lock for one of the pages in its
    purview. However, it has no ref on that monitor object or on the
    associated operation.

    What it is allowed to do is to move the monitor object to the operation's
    to_do list, but once it drops the work_lock, it's actually no longer
    permitted to access that object. However, it is trying to enqueue the
    retrieval operation for processing - but it can only do this via a pointer
    in the monitor object, something it shouldn't be doing.

    If it doesn't enqueue the operation, the operation may not get processed.
    If the order is flipped so that the enqueue is first, then it's possible
    for the work processor to look at the to_do list before the monitor is
    enqueued upon it.

    Fix this by getting a ref on the operation so that we can trust that it
    will still be there once we've added the monitor to the to_do list and
    dropped the work_lock. The op can then be enqueued after the lock is
    dropped.

    The bug can manifest in one of a couple of ways. The first manifestation
    looks like:

    FS-Cache:
    FS-Cache: Assertion failed
    FS-Cache: 6 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:494!
    RIP: 0010:fscache_put_operation+0x1e3/0x1f0
    ...
    fscache_op_work_func+0x26/0x50
    process_one_work+0x131/0x290
    worker_thread+0x45/0x360
    kthread+0xf8/0x130
    ? create_worker+0x190/0x190
    ? kthread_cancel_work_sync+0x10/0x10
    ret_from_fork+0x1f/0x30

    This is due to the operation being in the DEAD state (6) rather than
    INITIALISED, COMPLETE or CANCELLED (5) because it's already passed through
    fscache_put_operation().

    The bug can also manifest like the following:

    kernel BUG at fs/fscache/operation.c:69!
    ...
    [exception RIP: fscache_enqueue_operation+246]
    ...
    #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
    #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
    #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028

    I'm not entirely certain as to which is line 69 in Lei's kernel, so I'm not
    entirely clear which assertion failed.

    Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem")
    Reported-by: Lei Xue
    Reported-by: Vegard Nossum
    Reported-by: Anthony DeRobertis
    Reported-by: NeilBrown
    Reported-by: Daniel Axtens
    Reported-by: Kiran Kumar Modukuri
    Signed-off-by: David Howells
    Reviewed-by: Daniel Axtens

    Kiran Kumar Modukuri
     
  • Alter the state-check assertion in fscache_enqueue_operation() to allow
    cancelled operations to be given processing time so they can be cleaned up.

    Also fix a debugging statement that was requiring such operations to have
    an object assigned.

    Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem")
    Reported-by: Kiran Kumar Modukuri
    Signed-off-by: David Howells

    Kiran Kumar Modukuri
     
  • When inodes are freed in xfs_ifree(), di_flags is cleared (so extent size
    hints are removed) but the actual extent size fields are left intact.
    This causes the extent hint validators to fail on freed inodes which once
    had extent size hints.

    This can be observed (for example) by running xfs/229 twice on a
    non-crc xfs filesystem, or presumably on V5 with ikeep.

    Fixes: 7d71a67 ("xfs: verify extent size hint is valid in inode verifier")
    Fixes: 02a0fda ("xfs: verify COW extent size hint is valid in inode verifier")
    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     

23 Jul, 2018

1 commit

  • Pull vfs fixes from Al Viro:
    "Fix several places that screw up cleanups after failures halfway
    through opening a file (one open-coding filp_clone_open() and getting
    it wrong, two misusing alloc_file()). That part is -stable fodder from
    the 'work.open' branch.

    And Christoph's regression fix for uapi breakage in aio series;
    include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
    definition of sigset_t, the reason for doing so in the first place had
    been bogus - there's no need to expose struct __aio_sigset in
    aio_abi.h at all"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    aio: don't expose __aio_sigset in uapi
    ocxlflash_getfile(): fix double-iput() on alloc_file() failures
    cxl_getfile(): fix double-iput() on alloc_file() failures
    drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()

    Linus Torvalds
     

22 Jul, 2018

4 commits

  • Pull btrfs fix from David Sterba:
    "A fix of a corruption regarding fsync and clone, under some very
    specific conditions explained in the patch.

    The fix is marked for stable 3.16+ so I'd like to get it merged now
    given the impact"

    * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    Btrfs: fix file data corruption after cloning a range and fsync

    Linus Torvalds
     
  • Like vm_area_dup(), it initializes the anon_vma_chain head, and the
    basic mm pointer.

    The rest of the fields end up being different for different users,
    although the plan is to also initialize the 'vm_ops' field to a dummy
    entry.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The vm_area_struct is one of the most fundamental memory management
    objects, but the management of it is entirely open-coded evertwhere,
    ranging from allocation and freeing (using kmem_cache_[z]alloc and
    kmem_cache_free) to initializing all the fields.

    We want to unify this in order to end up having some unified
    initialization of the vmas, and the first step to this is to at least
    have basic allocation functions.

    Right now those functions are literally just wrappers around the
    kmem_cache_*() calls. This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

    to the point where the old vma passed in to the vm_area_dup() function
    isn't even used yet (because I've left all the old manual initialization
    alone).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • In parse_options(), if match_strdup() failed, parse_options() leaves
    opts->iocharset in unexpected state (i.e. still pointing the freed
    string). And this can be the cause of double free.

    To fix, this initialize opts->iocharset always when freeing.

    Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jp
    Signed-off-by: OGAWA Hirofumi
    Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

19 Jul, 2018

2 commits

  • When we clone a range into a file we can end up dropping existing
    extent maps (or trimming them) and replacing them with new ones if the
    range to be cloned overlaps with a range in the destination inode.
    When that happens we add the new extent maps to the list of modified
    extents in the inode's extent map tree, so that a "fast" fsync (the flag
    BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
    and log corresponding extent items. However, at the end of range cloning
    operation we do truncate all the pages in the affected range (in order to
    ensure future reads will not get stale data). Sometimes this truncation
    will release the corresponding extent maps besides the pages from the page
    cache. If this happens, then a "fast" fsync operation will miss logging
    some extent items, because it relies exclusively on the extent maps being
    present in the inode's extent tree, leading to data loss/corruption if
    the fsync ends up using the same transaction used by the clone operation
    (that transaction was not committed in the meanwhile). An extent map is
    released through the callback btrfs_invalidatepage(), which gets called by
    truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
    later ends up calling try_release_extent_mapping() which will release the
    extent map if some conditions are met, like the file size being greater
    than 16Mb, gfp flags allow blocking and the range not being locked (which
    is the case during the clone operation) nor being the extent map flagged
    as pinned (also the case for cloning).

    The following example, turned into a test for fstests, reproduces the
    issue:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
    $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

    $ xfs_io -c "fsync" /mnt/bar
    # reflink destination offset corresponds to the size of file bar,
    # 2728Kb minus 4Kb.
    $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
    $ xfs_io -c "fsync" /mnt/bar

    $ md5sum /mnt/bar
    95a95813a8c2abc9aa75a6c2914a077e /mnt/bar

    $ mount /dev/sdb /mnt
    $ md5sum /mnt/bar
    207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
    # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
    # power failure

    In the above example, the destination offset of the clone operation
    corresponds to the size of the "bar" file minus 4Kb. So during the clone
    operation, the extent map covering the range from 2572Kb to 2728Kb gets
    trimmed so that it ends at offset 2724Kb, and a new extent map covering
    the range from 2724Kb to 11724Kb is created. So at the end of the clone
    operation when we ask to truncate the pages in the range from 2724Kb to
    2724Kb + 15908Kb, the page invalidation callback ends up removing the new
    extent map (through try_release_extent_mapping()) when the page at offset
    2724Kb is passed to that callback.

    Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
    map is removed at try_release_extent_mapping(), forcing the next fsync to
    search for modified extents in the fs/subvolume tree instead of relying on
    the presence of extent maps in memory. This way we can continue doing a
    "fast" fsync if the destination range of a clone operation does not
    overlap with an existing range or if any of the criteria necessary to
    remove an extent map at try_release_extent_mapping() is not met (file
    size not bigger then 16Mb or gfp flags do not allow blocking).

    CC: stable@vger.kernel.org # 3.16+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • Pull btrfs fixes from David Sterba:
    "Three regression fixes. They're few-liners and fixing some corner
    cases missed in the origial patches"

    * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
    btrfs: fix use-after-free of cmp workspace pages
    btrfs: restore uuid_mutex in btrfs_open_devices

    Linus Torvalds
     

18 Jul, 2018

1 commit

  • glibc uses a different defintion of sigset_t than the kernel does,
    and the current version would pull in both. To fix this just do not
    expose the type at all - this somewhat mirrors pselect() where we
    do not even have a type for the magic sigmask argument, but just
    use pointer arithmetics.

    Fixes: 7a074e96 ("aio: implement io_pgetevents")
    Signed-off-by: Christoph Hellwig
    Reported-by: Adrian Reber
    Signed-off-by: Al Viro

    Christoph Hellwig
     

17 Jul, 2018

1 commit

  • In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device
    replace") we removed the branch of copy_nocow_pages() to avoid
    corruption for compressed nodatasum extents.

    However above commit only solves the problem in scrub_extent(), if
    during scrub_pages() we failed to read some pages,
    sctx->no_io_error_seen will be non-zero and we go to fixup function
    scrub_handle_errored_block().

    In scrub_handle_errored_block(), for sctx without csum (no matter if
    we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
    which does the similar thing with copy_nocow_pages(), but does it
    without the extra check in copy_nocow_pages() routine.

    So for test cases like btrfs/100, where we emulate read errors during
    replace/scrub, we could corrupt compressed extent data again.

    This patch will fix it just by avoiding any "optimization" for
    nodatasum, just falls back to the normal fixup routine by try read from
    any good copy.

    This also solves WARN_ON() or dead lock caused by lame backref iteration
    in scrub_fixup_nodatasum() routine.

    The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662
    ("btrfs: scrub: Don't use inode pages for device replace") since
    copy_nocow_pages() have better locking and extra check for data extent,
    and it's already doing the fixup work by try to read data from any good
    copy, so it won't go scrub_fixup_nodatasum() anyway.

    This patch disables the faulty code and will be removed completely in a
    followup patch.

    Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     

15 Jul, 2018

4 commits

  • ReiserFS prepares log messages into a 1024-byte buffer with no bounds
    checks. Long messages, such as the "unknown mount option" warning when
    userspace passes a crafted mount options string, overflow this buffer.
    This causes KASAN to report a global-out-of-bounds write.

    Fix it by truncating messages to the buffer size.

    Link: http://lkml.kernel.org/r/20180707203621.30922-1-ebiggers3@gmail.com
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: syzbot+b890b3335a4d8c608963@syzkaller.appspotmail.com
    Signed-off-by: Eric Biggers
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • The current code does not make sure to page align bss before calling
    vm_brk(), and this can lead to a VM_BUG_ON() in __mm_populate() due to
    the requested lenght not being correctly aligned.

    Let us make sure to align it properly.

    Kees: only applicable to CONFIG_USELIB kernels: 32-bit and configured
    for libc5.

    Link: http://lkml.kernel.org/r/20180705145539.9627-1-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Reported-by: syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com
    Tested-by: Tetsuo Handa
    Acked-by: Kees Cook
    Cc: Michal Hocko
    Cc: Nicolas Pitre
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • The autofs subsystem does not check that the "path" parameter is present
    for all cases where it is required when it is passed in via the "param"
    struct.

    In particular it isn't checked for the AUTOFS_DEV_IOCTL_OPENMOUNT_CMD
    ioctl command.

    To solve it, modify validate_dev_ioctl(function to check that a path has
    been provided for ioctl commands that require it.

    Link: http://lkml.kernel.org/r/153060031527.26631.18306637892746301555.stgit@pluto.themaw.net
    Signed-off-by: Tomas Bortoli
    Signed-off-by: Ian Kent
    Reported-by: syzbot+60c837b428dc84e83a93@syzkaller.appspotmail.com
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomas Bortoli
     
  • Thomas reports:
    "While looking around in /proc on my v4.14.52 system I noticed that all
    processes got a lot of "Locked" memory in /proc/*/smaps. A lot more
    memory than a regular user can usually lock with mlock().

    Commit 493b0e9d945f (in v4.14-rc1) seems to have changed the behavior
    of "Locked".

    Before that commit the code was like this. Notice the VM_LOCKED check.

    (vma->vm_flags & VM_LOCKED) ?
    (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

    After that commit Locked is now the same as Pss:

    (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));

    This looks like a mistake."

    Indeed, the commit has added mss->pss_locked with the correct value that
    depends on VM_LOCKED, but forgot to actually use it. Fix it.

    Link: http://lkml.kernel.org/r/ebf6c7fb-fec3-6a26-544f-710ed193c154@suse.cz
    Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup")
    Signed-off-by: Vlastimil Babka
    Reported-by: Thomas Lindroth
    Cc: Alexey Dobriyan
    Cc: Daniel Colascione
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

13 Jul, 2018

3 commits

  • btrfs_cmp_data_free() puts cmp's src_pages and dst_pages, but leaves
    their page address intact. Now, if you hit "goto again" in
    btrfs_extent_same_range() and hit some error in
    btrfs_cmp_data_prepare(), you'll try to unlock/put already put pages.

    This is simple fix to reset the address to avoid use-after-free.

    Fixes: 67b07bd4bec5 ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
    Signed-off-by: Naohiro Aota
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Naohiro Aota
     
  • Commit 542c5908abfe84f7b4c1 ("btrfs: replace uuid_mutex by
    device_list_mutex in btrfs_open_devices") switched to device_list_mutex
    as we need that for the device list traversal, but we also need
    uuid_mutex to protect access to fs_devices::opened to be consistent with
    other users of that.

    Fixes: 542c5908abfe84f7b4c1 ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices")
    Reviewed-by: Anand Jain
    Signed-off-by: David Sterba

    David Sterba
     
  • With commit 044e6e3d74a3: "ext4: don't update checksum of new
    initialized bitmaps" the buffer valid bit will get set without
    actually setting up the checksum for the allocation bitmap, since the
    checksum will get calculated once we actually allocate an inode or
    block.

    If we are doing this, then we need to (re-)check the verified bit
    after we take the block group lock. Otherwise, we could race with
    another process reading and verifying the bitmap, which would then
    complain about the checksum being invalid.

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137

    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Theodore Ts'o
     

11 Jul, 2018

1 commit


10 Jul, 2018

1 commit

  • The inline data code was updating the raw inode directly; this is
    problematic since if metadata checksums are enabled,
    ext4_mark_inode_dirty() must be called to update the inode's checksum.
    In addition, the jbd2 layer requires that get_write_access() be called
    before the metadata buffer is modified. Fix both of these problems.

    https://bugzilla.kernel.org/show_bug.cgi?id=200443

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

09 Jul, 2018

3 commits

  • Previously, when an MMP-protected file system is remounted read-only,
    the kmmpd thread would exit the next time it woke up (a few seconds
    later), without resetting the MMP sequence number back to
    EXT4_MMP_SEQ_CLEAN.

    Fix this by explicitly killing the MMP thread when the file system is
    remounted read-only.

    Signed-off-by: Theodore Ts'o
    Cc: Andreas Dilger

    Theodore Ts'o
     
  • Ext4_check_descriptors() was getting called before s_gdb_count was
    initialized. So for file systems w/o the meta_bg feature, allocation
    bitmaps could overlap the block group descriptors and ext4 wouldn't
    notice.

    For file systems with the meta_bg feature enabled, there was a
    fencepost error which would cause the ext4_check_descriptors() to
    incorrectly believe that the block allocation bitmap overlaps with the
    block group descriptor blocks, and it would reject the mount.

    Fix both of these problems.

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • Pull ext4 bugfixes from Ted Ts'o:
    "Bug fixes for ext4; most of which relate to vulnerabilities where a
    maliciously crafted file system image can result in a kernel OOPS or
    hang.

    At least one fix addresses an inline data bug could be triggered by
    userspace without the need of a crafted file system (although it does
    require that the inline data feature be enabled)"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: check superblock mapped prior to committing
    ext4: add more mount time checks of the superblock
    ext4: add more inode number paranoia checks
    ext4: avoid running out of journal credits when appending to an inline file
    jbd2: don't mark block as modified if the handle is out of credits
    ext4: never move the system.data xattr out of the inode body
    ext4: clear i_data in ext4_inode_info when removing inline data
    ext4: include the illegal physical block in the bad map ext4_error msg
    ext4: verify the depth of extent tree in ext4_find_extent()
    ext4: only look at the bg_flags field if it is valid
    ext4: make sure bitmaps and the inode table don't overlap with bg descriptors
    ext4: always check block group bounds in ext4_init_block_bitmap()
    ext4: always verify the magic number in xattr blocks
    ext4: add corruption check in ext4_xattr_set_entry()
    ext4: add warn_on_error mount option

    Linus Torvalds
     

08 Jul, 2018

1 commit

  • Pull cifs fixes from Steve French:
    "Five smb3/cifs fixes for stable (including for some leaks and memory
    overwrites) and also a few fixes for recent regressions in packet
    signing.

    Additional testing at the recent SMB3 test event, and some good work
    by Paulo and others spotted the issues fixed here. In addition to my
    xfstest runs on these, Aurelien and Stefano did additional test runs
    to verify this set"

    * tag '4.18-rc3-smb3fixes' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: Fix stack out-of-bounds in smb{2,3}_create_lease_buf()
    cifs: Fix infinite loop when using hard mount option
    cifs: Fix slab-out-of-bounds in send_set_info() on SMB2 ACE setting
    cifs: Fix memory leak in smb2_set_ea()
    cifs: fix SMB1 breakage
    cifs: Fix validation of signed data in smb2
    cifs: Fix validation of signed data in smb3+
    cifs: Fix use after free of a mid_q_entry

    Linus Torvalds
     

06 Jul, 2018

2 commits

  • sgid directories have special semantics, making newly created files in
    the directory belong to the group of the directory, and newly created
    subdirectories will also become sgid. This is historically used for
    group-shared directories.

    But group directories writable by non-group members should not imply
    that such non-group members can magically join the group, so make sure
    to clear the sgid bit on non-directories for non-members (but remember
    that sgid without group execute means "mandatory locking", just to
    confuse things even more).

    Reported-by: Jann Horn
    Cc: Andy Lutomirski
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • smb{2,3}_create_lease_buf() store a lease key in the lease
    context for later usage on a lease break.

    In most paths, the key is currently sourced from data that
    happens to be on the stack near local variables for oplock in
    SMB2_open() callers, e.g. from open_shroot(), whereas
    smb2_open_file() properly allocates space on its stack for it.

    The address of those local variables holding the oplock is then
    passed to create_lease_buf handlers via SMB2_open(), and 16
    bytes near oplock are used. This causes a stack out-of-bounds
    access as reported by KASAN on SMB2.1 and SMB3 mounts (first
    out-of-bounds access is shown here):

    [ 111.528823] BUG: KASAN: stack-out-of-bounds in smb3_create_lease_buf+0x399/0x3b0 [cifs]
    [ 111.530815] Read of size 8 at addr ffff88010829f249 by task mount.cifs/985
    [ 111.532838] CPU: 3 PID: 985 Comm: mount.cifs Not tainted 4.18.0-rc3+ #91
    [ 111.534656] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [ 111.536838] Call Trace:
    [ 111.537528] dump_stack+0xc2/0x16b
    [ 111.540890] print_address_description+0x6a/0x270
    [ 111.542185] kasan_report+0x258/0x380
    [ 111.544701] smb3_create_lease_buf+0x399/0x3b0 [cifs]
    [ 111.546134] SMB2_open+0x1ef8/0x4b70 [cifs]
    [ 111.575883] open_shroot+0x339/0x550 [cifs]
    [ 111.591969] smb3_qfs_tcon+0x32c/0x1e60 [cifs]
    [ 111.617405] cifs_mount+0x4f3/0x2fc0 [cifs]
    [ 111.674332] cifs_smb3_do_mount+0x263/0xf10 [cifs]
    [ 111.677915] mount_fs+0x55/0x2b0
    [ 111.679504] vfs_kern_mount.part.22+0xaa/0x430
    [ 111.684511] do_mount+0xc40/0x2660
    [ 111.698301] ksys_mount+0x80/0xd0
    [ 111.701541] do_syscall_64+0x14e/0x4b0
    [ 111.711807] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 111.713665] RIP: 0033:0x7f372385b5fa
    [ 111.715311] Code: 48 8b 0d 99 78 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 66 78 2c 00 f7 d8 64 89 01 48
    [ 111.720330] RSP: 002b:00007ffff27049d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
    [ 111.722601] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f372385b5fa
    [ 111.724842] RDX: 000055c2ecdc73b2 RSI: 000055c2ecdc73f9 RDI: 00007ffff270580f
    [ 111.727083] RBP: 00007ffff2705804 R08: 000055c2ee976060 R09: 0000000000001000
    [ 111.729319] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f3723f4d000
    [ 111.731615] R13: 000055c2ee976060 R14: 00007f3723f4f90f R15: 0000000000000000

    [ 111.735448] The buggy address belongs to the page:
    [ 111.737420] page:ffffea000420a7c0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
    [ 111.739890] flags: 0x17ffffc0000000()
    [ 111.741750] raw: 0017ffffc0000000 0000000000000000 dead000000000200 0000000000000000
    [ 111.744216] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
    [ 111.746679] page dumped because: kasan: bad access detected

    [ 111.750482] Memory state around the buggy address:
    [ 111.752562] ffff88010829f100: 00 f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00
    [ 111.754991] ffff88010829f180: 00 00 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00
    [ 111.757401] >ffff88010829f200: 00 00 00 00 00 f1 f1 f1 f1 01 f2 f2 f2 f2 f2 f2
    [ 111.759801] ^
    [ 111.762034] ffff88010829f280: f2 02 f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00
    [ 111.764486] ffff88010829f300: f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ 111.766913] ==================================================================

    Lease keys are however already generated and stored in fid data
    on open and create paths: pass them down to the lease context
    creation handlers and use them.

    Suggested-by: Aurélien Aptel
    Reviewed-by: Aurelien Aptel
    Fixes: b8c32dbb0deb ("CIFS: Request SMB2.1 leases")
    Signed-off-by: Stefano Brivio
    Signed-off-by: Steve French

    Stefano Brivio