07 Jul, 2017

11 commits


01 Jul, 2017

1 commit


30 Jun, 2017

1 commit

  • Pull block fixes from Jens Axboe:
    "Two fixes that should go into this release.

    One is an nvme regression fix from Keith, fixing a missing queue
    freeze if the controller is being reset. This causes the reset to
    hang.

    The other is a fix for a leak of the bio protection info, if smaller
    sized O_DIRECT is used. This fix should be more involved as we have
    other problematic paths in the kernel, but given as this isn't a
    regression in this series, we'll tackle those for 4.13"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: provide bio_uninit() free freeing integrity/task associations
    nvme/pci: Fix stuck nvme reset

    Linus Torvalds
     

29 Jun, 2017

2 commits

  • Wen reports significant memory leaks with DIF and O_DIRECT:

    "With nvme devive + T10 enabled, On a system it has 256GB and started
    logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
    it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
    leaking.

    /proc/meminfo | grep SUnreclaim...

    SUnreclaim: 6752128 kB
    SUnreclaim: 6874880 kB
    SUnreclaim: 7238080 kB
    ....
    SUnreclaim: 22307264 kB
    SUnreclaim: 22485888 kB
    SUnreclaim: 22720256 kB

    When testcases with T10 enabled call into __blkdev_direct_IO_simple,
    code doesn't free memory allocated by bio_integrity_alloc. The patch
    fixes the issue. HTX has been run with +60 hours without failure."

    Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
    doesn't go through the regular bio free. This means that any ancillary
    data allocated with the bio through the stack is not freed. Hence, we
    can leak the integrity data associated with the bio, if the device is
    using DIF/DIX.

    Fix this by providing a bio_uninit() and export it, so that we can use
    it to free this data. Note that this is a minimal fix for this issue.
    Any current user of bio's that are allocated outside of
    bio_alloc_bioset() suffers from this issue, most notably some drivers.
    We will fix those in a more comprehensive patch for 4.13. This also
    means that the commit marked as being fixed by this isn't the real
    culprit, it's just the most obvious one out there.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Reported-by: Wen Xiong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Bugfixes include:

    - stable fix for exclusive create if the server supports the umask
    attribute

    - trunking detection should handle ERESTARTSYS/EINTR

    - stable fix for a race in the LAYOUTGET function

    - stable fix to revert "nfs_rename() handle -ERESTARTSYS dentry left
    behind"

    - nfs4_callback_free_slot() cannot call nfs4_slot_tbl_drain_complete()"

    * tag 'nfs-for-4.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4.1: nfs4_callback_free_slot() cannot call nfs4_slot_tbl_drain_complete()
    Revert "NFS: nfs_rename() handle -ERESTARTSYS dentry left behind"
    NFSv4.1: Fix a race in nfs4_proc_layoutget
    NFS: Trunking detection should handle ERESTARTSYS/EINTR
    NFSv4.2: Don't send mode again in post-EXCLUSIVE4_1 SETATTR with umask

    Linus Torvalds
     

28 Jun, 2017

6 commits

  • When copying up a file that has multiple hard links we need to break any
    association with the origin file. This makes copy-up be essentially an
    atomic replace.

    The new file has nothing to do with the old one (except having the same
    data and metadata initially), so don't set the overlay.origin attribute.

    We can relax this in the future when we are able to index upper object by
    origin.

    Signed-off-by: Miklos Szeredi
    Fixes: 3a1e819b4e80 ("ovl: store file handle of lower inode on copy up")

    Miklos Szeredi
     
  • Nothing prevents mischief on upper layer while we are busy copying up the
    data.

    Move the lookup right before the looked up dentry is actually used.

    Signed-off-by: Miklos Szeredi
    Fixes: 01ad3eb8a073 ("ovl: concurrent copy up of regular files")
    Cc: # v4.11

    Miklos Szeredi
     
  • The current code works only for the case where we have exactly one slot,
    which is no longer true.
    nfs4_free_slot() will automatically declare the callback channel to be
    drained when all slots have been returned.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This reverts commit 920b4530fb80430ff30ef83efe21ba1fa5623731 which could
    call d_move() without holding the directory's i_mutex, and reverts commit
    d4ea7e3c5c0e341c15b073016dbf3ab6c65f12f3 "NFS: Fix old dentry rehash after
    move", which was a follow-up fix.

    Signed-off-by: Benjamin Coddington
    Fixes: 920b4530fb80 ("NFS: nfs_rename() handle -ERESTARTSYS dentry left behind")
    Cc: stable@vger.kernel.org # v4.10+
    Reviewed-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Benjamin Coddington
     
  • If the task calling layoutget is signalled, then it is possible for the
    calls to nfs4_sequence_free_slot() and nfs4_layoutget_prepare() to race,
    in which case we leak a slot.
    The fix is to move the call to nfs4_sequence_free_slot() into the
    nfs4_layoutget_release() so that it gets called at task teardown time.

    Fixes: 2e80dbe7ac51 ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
    Cc: stable@vger.kernel.org # v4.8+
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Currently, it will return EIO in those cases.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

24 Jun, 2017

6 commits

  • Merge misc fixes from Andrew Morton:
    "8 fixes"

    * emailed patches from Andrew Morton :
    fs/exec.c: account for argv/envp pointers
    ocfs2: fix deadlock caused by recursive locking in xattr
    slub: make sysfs file removal asynchronous
    lib/cmdline.c: fix get_options() overflow while parsing ranges
    fs/dax.c: fix inefficiency in dax_writeback_mapping_range()
    autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAIL
    mm/vmalloc.c: huge-vmap: fail gracefully on unexpected huge vmap mappings
    mm, thp: remove cond_resched from __collapse_huge_page_copy

    Linus Torvalds
     
  • When limiting the argv/envp strings during exec to 1/4 of the stack limit,
    the storage of the pointers to the strings was not included. This means
    that an exec with huge numbers of tiny strings could eat 1/4 of the stack
    limit in strings and then additional space would be later used by the
    pointers to the strings.

    For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
    single-byte strings would consume less than 2MB of stack, the max (8MB /
    4) amount allowed, but the pointers to the strings would consume the
    remaining additional stack space (1677721 * 4 == 6710884).

    The result (1677721 + 6710884 == 8388605) would exhaust stack space
    entirely. Controlling this stack exhaustion could result in
    pathological behavior in setuid binaries (CVE-2017-1000365).

    [akpm@linux-foundation.org: additional commenting from Kees]
    Fixes: b6a2fea39318 ("mm: variable length argument support")
    Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast
    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Alexander Viro
    Cc: Qualys Security Advisory
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Another deadlock path caused by recursive locking is reported. This
    kind of issue was introduced since commit 743b5f1434f5 ("ocfs2: take
    inode lock in ocfs2_iop_set/get_acl()"). Two deadlock paths have been
    fixed by commit b891fa5024a9 ("ocfs2: fix deadlock issue when taking
    inode lock at vfs entry points"). Yes, we intend to fix this kind of
    case in incremental way, because it's hard to find out all possible
    paths at once.

    This one can be reproduced like this. On node1, cp a large file from
    home directory to ocfs2 mountpoint. While on node2, run
    setfacl/getfacl. Both nodes will hang up there. The backtraces:

    On node1:
    __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
    ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
    ocfs2_write_begin+0x43/0x1a0 [ocfs2]
    generic_perform_write+0xa9/0x180
    __generic_file_write_iter+0x1aa/0x1d0
    ocfs2_file_write_iter+0x4f4/0xb40 [ocfs2]
    __vfs_write+0xc3/0x130
    vfs_write+0xb1/0x1a0
    SyS_write+0x46/0xa0

    On node2:
    __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
    ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
    ocfs2_xattr_set+0x12e/0xe80 [ocfs2]
    ocfs2_set_acl+0x22d/0x260 [ocfs2]
    ocfs2_iop_set_acl+0x65/0xb0 [ocfs2]
    set_posix_acl+0x75/0xb0
    posix_acl_xattr_set+0x49/0xa0
    __vfs_setxattr+0x69/0x80
    __vfs_setxattr_noperm+0x72/0x1a0
    vfs_setxattr+0xa7/0xb0
    setxattr+0x12d/0x190
    path_setxattr+0x9f/0xb0
    SyS_setxattr+0x14/0x20

    Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
    exported by commit 439a36b8ef38 ("ocfs2/dlmglue: prepare tracking logic
    to avoid recursive cluster lock").

    Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com
    Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
    Signed-off-by: Eric Ren
    Reported-by: Thomas Voegtle
    Tested-by: Thomas Voegtle
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Ren
     
  • dax_writeback_mapping_range() fails to update iteration index when
    searching radix tree for entries needing cache flushing. Thus each
    pagevec worth of entries is searched starting from the start which is
    inefficient and prone to livelocks. Update index properly.

    Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz
    Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync")
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • If a positive status is passed with the AUTOFS_DEV_IOCTL_FAIL ioctl,
    autofs4_d_automount() will return

    ERR_PTR(status)

    with that status to follow_automount(), which will then dereference an
    invalid pointer.

    So treat a positive status the same as zero, and map to ENOENT.

    See comment in systemd src/core/automount.c::automount_send_ready().

    Link: http://lkml.kernel.org/r/871sqwczx5.fsf@notabene.neil.brown.name
    Signed-off-by: NeilBrown
    Cc: Ian Kent
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Pull xfs fixes from Darrick Wong:
    "I have one more bugfix for you for 4.12-rc7 to fix a disk corruption
    problem:

    - don't allow swapon on files on the realtime device, because the
    swap code will swap pages out to blocks on the data device, thereby
    corrupting the filesystem"

    * tag 'xfs-4.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: don't allow bmap on rt files

    Linus Torvalds
     

23 Jun, 2017

1 commit

  • Pull cifs fixes from Steve French:
    "Various small fixes for stable"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    CIFS: Fix some return values in case of error in 'crypt_message'
    cifs: remove redundant return in cifs_creation_time_get
    CIFS: Improve readdir verbosity
    CIFS: check if pages is null rather than bv for a failed allocation
    CIFS: Set ->should_dirty in cifs_user_readv()

    Linus Torvalds
     

22 Jun, 2017

2 commits

  • bmap returns a dumb LBA address but not the block device that goes with
    that LBA. Swapfiles don't care about this and will blindly assume that
    the data volume is the correct blockdev, which is totally bogus for
    files on the rt subvolume. This results in the swap code doing IOs to
    arbitrary locations on the data device(!) if the passed in mapping is a
    realtime file, so just turn off bmap for rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Pull more ufs fixes from Al Viro:
    "More UFS fixes, unfortunately including build regression fix for the
    64-bit s_dsize commit. Fixed in this pile:

    - trivial bug in signedness of 32bit timestamps on ufs1

    - ESTALE instead of ufs_error() when doing open-by-fhandle on
    something deleted

    - build regression on 32bit in ufs_new_fragments() - calculating that
    many percents of u64 pulls libgcc stuff on some of those. Mea
    culpa.

    - fix hysteresis loop broken by typo in 2.4.14.7 (right next to the
    location of previous bug).

    - fix the insane limits of said hysteresis loop on filesystems with
    very low percentage of reserved blocks. If it's 5% or less, just
    use the OPTSPACE policy.

    - calculate those limits once and mount time.

    This tree does pass xfstests clean (both ufs1 and ufs2) and it _does_
    survive cross-builds.

    Again, my apologies for missing that, especially since I have noticed
    a related percentage-of-64bit issue in earlier patches (when dealing
    with amount of reserved blocks). Self-LART applied..."

    * 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ufs: fix the logics for tail relocation
    ufs_iget(): fail with -ESTALE on deleted inode
    fix signedness of timestamps on ufs1

    Linus Torvalds
     

21 Jun, 2017

5 commits


19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 Jun, 2017

4 commits