07 Aug, 2017

11 commits

  • [ Upstream commit c2931667c83ded6504b3857e99cc45b21fa496fb ]

    Currently how btrfs dio deals with split dio write is not good
    enough if dio write is split into several segments due to the
    lack of contiguous space, a large dio write like 'dd bs=1G count=1'
    can end up with incorrect outstanding_extents counter and endio
    would complain loudly with an assertion.

    This fixes the problem by compensating the outstanding_extents
    counter in inode if a large dio write gets split.

    Reported-by: Anand Jain
    Tested-by: Anand Jain
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 781feef7e6befafd4d9787d1f7ada1f9ccd504e4 ]

    While checking INODE_REF/INODE_EXTREF for a corner case, we may acquire a
    different inode's log_mutex with holding the current inode's log_mutex, and
    lockdep has complained this with a possilble deadlock warning.

    Fix this by using mutex_lock_nested() when processing the other inode's
    log_mutex.

    Reviewed-by: Filipe Manana
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit e321f8a801d7b4c40da8005257b05b9c2b51b072 ]

    If @block_group is not @used_bg, it'll try to get @used_bg's lock without
    droping @block_group 's lock and lockdep has throwed a scary deadlock warning
    about it.
    Fix it by using down_read_nested.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • commit e9a330c4289f2ba1ca4bf98c2b430ab165a8931b upstream.

    The per-prz spinlock should be using the dynamic initializer so that
    lockdep can correctly track it. Without this, under lockdep, we get a
    warning at boot that the lock is in non-static memory.

    Fixes: 109704492ef6 ("pstore: Make spinlock per zone instead of global")
    Fixes: 76d5692a5803 ("pstore: Correctly initialize spinlock and flags")
    Signed-off-by: Kees Cook
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit 76d5692a58031696e282384cbd893832bc92bd76 upstream.

    The ram backend wasn't always initializing its spinlock correctly. Since
    it was coming from kzalloc memory, though, it was harmless on
    architectures that initialize unlocked spinlocks to 0 (at least x86 and
    ARM). This also fixes a possibly ignored flag setting too.

    When running under CONFIG_DEBUG_SPINLOCK, the following Oops was visible:

    [ 0.760836] persistent_ram: found existing buffer, size 29988, start 29988
    [ 0.765112] persistent_ram: found existing buffer, size 30105, start 30105
    [ 0.769435] persistent_ram: found existing buffer, size 118542, start 118542
    [ 0.785960] persistent_ram: found existing buffer, size 0, start 0
    [ 0.786098] persistent_ram: found existing buffer, size 0, start 0
    [ 0.786131] pstore: using zlib compression
    [ 0.790716] BUG: spinlock bad magic on CPU#0, swapper/0/1
    [ 0.790729] lock: 0xffffffc0d1ca9bb0, .magic: 00000000, .owner: /-1, .owner_cpu: 0
    [ 0.790742] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.10.0-rc2+ #913
    [ 0.790747] Hardware name: Google Kevin (DT)
    [ 0.790750] Call trace:
    [ 0.790768] [] dump_backtrace+0x0/0x2bc
    [ 0.790780] [] show_stack+0x20/0x28
    [ 0.790794] [] dump_stack+0xa4/0xcc
    [ 0.790809] [] spin_dump+0xe0/0xf0
    [ 0.790821] [] spin_bug+0x30/0x3c
    [ 0.790834] [] do_raw_spin_lock+0x50/0x1b8
    [ 0.790846] [] _raw_spin_lock_irqsave+0x54/0x6c
    [ 0.790862] [] buffer_size_add+0x48/0xcc
    [ 0.790875] [] persistent_ram_write+0x60/0x11c
    [ 0.790888] [] ramoops_pstore_write_buf+0xd4/0x2a4
    [ 0.790900] [] pstore_console_write+0xf0/0x134
    [ 0.790912] [] console_unlock+0x48c/0x5e8
    [ 0.790923] [] register_console+0x3b0/0x4d4
    [ 0.790935] [] pstore_register+0x1a8/0x234
    [ 0.790947] [] ramoops_probe+0x6b8/0x7d4
    [ 0.790961] [] platform_drv_probe+0x7c/0xd0
    [ 0.790972] [] driver_probe_device+0x1b4/0x3bc
    [ 0.790982] [] __device_attach_driver+0xc8/0xf4
    [ 0.790996] [] bus_for_each_drv+0xb4/0xe4
    [ 0.791006] [] __device_attach+0xd0/0x158
    [ 0.791016] [] device_initial_probe+0x24/0x30
    [ 0.791026] [] bus_probe_device+0x50/0xe4
    [ 0.791038] [] device_add+0x3a4/0x76c
    [ 0.791051] [] of_device_add+0x74/0x84
    [ 0.791062] [] of_platform_device_create_pdata+0xc0/0x100
    [ 0.791073] [] of_platform_device_create+0x34/0x40
    [ 0.791086] [] of_platform_default_populate_init+0x58/0x78
    [ 0.791097] [] do_one_initcall+0x88/0x160
    [ 0.791109] [] kernel_init_freeable+0x264/0x31c
    [ 0.791123] [] kernel_init+0x18/0x11c
    [ 0.791133] [] ret_from_fork+0x10/0x50
    [ 0.793717] console [pstore-1] enabled
    [ 0.797845] pstore: Registered ramoops as persistent store backend
    [ 0.804647] ramoops: attached 0x100000@0xf7edc000, ecc: 0/0

    Fixes: 663deb47880f ("pstore: Allow prz to control need for locking")
    Fixes: 109704492ef6 ("pstore: Make spinlock per zone instead of global")
    Reported-by: Brian Norris
    Signed-off-by: Kees Cook
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit 663deb47880f2283809669563c5a52ac7c6aef1a upstream.

    In preparation of not locking at all for certain buffers depending on if
    there's contention, make locking optional depending on the initialization
    of the prz.

    Signed-off-by: Joel Fernandes
    [kees: moved locking flag into prz instead of via caller arguments]
    Signed-off-by: Kees Cook
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes
     
  • commit 49d31c2f389acfe83417083e1208422b4091cd9e upstream.

    take_dentry_name_snapshot() takes a safe snapshot of dentry name;
    if the name is a short one, it gets copied into caller-supplied
    structure, otherwise an extra reference to external name is grabbed
    (those are never modified). In either case the pointer to stable
    string is stored into the same structure.

    dentry must be held by the caller of take_dentry_name_snapshot(),
    but may be freely dropped afterwards - the snapshot will stay
    until destroyed by release_dentry_name_snapshot().

    Intended use:
    struct name_snapshot s;

    take_dentry_name_snapshot(&s, dentry);
    ...
    access s.name
    ...
    release_dentry_name_snapshot(&s);

    Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
    to pass down with event.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit b7dbcc0e433f0f61acb89ed9861ec996be4f2b38 upstream.

    nfs4_retry_setlk() sets the task's state to TASK_INTERRUPTIBLE within the
    same region protected by the wait_queue's lock after checking for a
    notification from CB_NOTIFY_LOCK callback. However, after releasing that
    lock, a wakeup for that task may race in before the call to
    freezable_schedule_timeout_interruptible() and set TASK_WAKING, then
    freezable_schedule_timeout_interruptible() will set the state back to
    TASK_INTERRUPTIBLE before the task will sleep. The result is that the task
    will sleep for the entire duration of the timeout.

    Since we've already set TASK_INTERRUPTIBLE in the locked section, just use
    freezable_schedule_timout() instead.

    Fixes: a1d617d8f134 ("nfs: allow blocking locks to be awoken by lock callbacks")
    Signed-off-by: Benjamin Coddington
    Reviewed-by: Jeff Layton
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Coddington
     
  • commit 442ce0499c0535f8972b68fa1fda357357a5c953 upstream.

    Prior to commit ca0daa277aca ("NFS: Cache aggressively when file is open
    for writing"), NFS would revalidate, or invalidate, the file size when
    taking a lock. Since that commit it only invalidates the file content.

    If the file size is changed on the server while wait for the lock, the
    client will have an incorrect understanding of the file size and could
    corrupt data. This particularly happens when writing beyond the
    (supposed) end of file and can be easily be demonstrated with
    posix_fallocate().

    If an application opens an empty file, waits for a write lock, and then
    calls posix_fallocate(), glibc will determine that the underlying
    filesystem doesn't support fallocate (assuming version 4.1 or earlier)
    and will write out a '0' byte at the end of each 4K page in the region
    being fallocated that is after the end of the file.
    NFS will (usually) detect that these writes are beyond EOF and will
    expand them to cover the whole page, and then will merge the pages.
    Consequently, NFS will write out large blocks of zeroes beyond where it
    thought EOF was. If EOF had moved, the pre-existing part of the file
    will be over-written. Locking should have protected against this,
    but it doesn't.

    This patch restores the use of nfs_zap_caches() which invalidated the
    cached attributes. When posix_fallocate() asks for the file size, the
    request will go to the server and get a correct answer.

    Fixes: ca0daa277aca ("NFS: Cache aggressively when file is open for writing")
    Signed-off-by: NeilBrown
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 9bcf66c72d726322441ec82962994e69157613e4 upstream.

    When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by moving posix_acl_update_mode() out of
    __jfs_set_acl() into jfs_set_acl(). That way the function will not be
    called when inheriting ACLs which is what we want as it prevents SGID
    bit clearing and the mode has been properly set by posix_acl_create()
    anyway.

    Fixes: 073931017b49d9458aa351605b43a7e34598caef
    CC: jfs-discussion@lists.sourceforge.net
    Signed-off-by: Jan Kara
    Signed-off-by: Dave Kleikamp
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 109704492ef637956265ec2eb72ae7b3b39eb6f4 upstream.

    Currently pstore has a global spinlock for all zones. Since the zones
    are independent and modify different areas of memory, there's no need
    to have a global lock, so we should use a per-zone lock as introduced
    here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag
    introduced later, which splits the ftrace memory area into a single zone
    per CPU, it will eliminate the need for locking. In preparation for this,
    make the locking optional.

    Signed-off-by: Joel Fernandes
    [kees: updated commit message]
    Signed-off-by: Kees Cook
    Cc: Leo Yan
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes
     

28 Jul, 2017

13 commits

  • commit 6883cd7f68245e43e91e5ee583b7550abf14523f upstream.

    When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by moving posix_acl_update_mode() out of
    __reiserfs_set_acl() into reiserfs_set_acl(). That way the function will
    not be called when inheriting ACLs which is what we want as it prevents
    SGID bit clearing and the mode has been properly set by
    posix_acl_create() anyway.

    Fixes: 073931017b49d9458aa351605b43a7e34598caef
    CC: reiserfs-devel@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 8fc646b44385ff0a9853f6590497e43049eeb311 upstream.

    On failure to prepare_creds(), mount fails with a random
    return value, as err was last set to an integer cast of
    a valid lower mnt pointer or set to 0 if inodes index feature
    is enabled.

    Reported-by: Dan Carpenter
    Fixes: 3fe6e52f0626 ("ovl: override creds with the ones from ...")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 84969465ddc4f8aeb3b993123b571aa01c5f2683 upstream.

    When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by creating __hfsplus_set_posix_acl() function that does
    not call posix_acl_update_mode() and use it when inheriting ACLs. That
    prevents SGID bit clearing and the mode has been properly set by
    posix_acl_create() anyway.

    Fixes: 073931017b49d9458aa351605b43a7e34598caef
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 84583cfb973c4313955c6231cc9cb3772d280b15 upstream.

    For a large directory, program needs to issue multiple readdir
    syscalls to get all dentries. When there are multiple programs
    read the directory concurrently. Following sequence of events
    can happen.

    - program calls readdir with pos = 2. ceph sends readdir request
    to mds. The reply contains N1 entries. ceph adds these N1 entries
    to readdir cache.
    - program calls readdir with pos = N1+2. The readdir is satisfied
    by the readdir cache, N2 entries are returned. (Other program
    calls readdir in the middle, which fills the cache)
    - program calls readdir with pos = N1+N2+2. ceph sends readdir
    request to mds. The reply contains N3 entries and it reaches
    directory end. ceph adds these N3 entries to the readdir cache
    and marks directory complete.

    The second readdir call does not update fi->readdir_cache_idx.
    ceph add the last N3 entries to wrong places.

    Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Greg Kroah-Hartman

    Yan, Zheng
     
  • commit f2e95355891153f66d4156bf3a142c6489cd78c6 upstream.

    udf_setsize() called truncate_setsize() with i_data_sem held. Thus
    truncate_pagecache() called from truncate_setsize() could lock a page
    under i_data_sem which can deadlock as page lock ranks below
    i_data_sem - e. g. writeback can hold page lock and try to acquire
    i_data_sem to map a block.

    Fix the problem by moving truncate_setsize() calls from under
    i_data_sem. It is safe for us to change i_size without holding
    i_data_sem as all the places that depend on i_size being stable already
    hold inode_lock.

    Fixes: 7e49b6f2480cb9a9e7322a91592e56a5c85361f5
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit cc89684c9a265828ce061037f1f79f4a68ccd3f7 upstream.

    Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate")
    in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
    to be invalidated even if it has filesystems mounted on or it or on a
    descendant. The mounted filesystem is unmounted.

    This means we need to be careful not to return 0 unless the directory
    referred to truly is invalid. So -ESTALE or -ENOENT should invalidate
    the directory. Other errors such a -EPERM or -ERESTARTSYS should be
    returned from ->d_revalidate() so they are propagated to the caller.

    A particular problem can be demonstrated by:

    1/ mount an NFS filesystem using NFSv3 on /mnt
    2/ mount any other filesystem on /mnt/foo
    3/ ls /mnt/foo
    4/ turn off network, or otherwise make the server unable to respond
    5/ ls /mnt/foo &
    6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
    7/ kill -9 $! # this results in -ERESTARTSYS being returned
    8/ observe that /mnt/foo has been unmounted.

    This patch changes nfs_lookup_revalidate() to only treat
    -ESTALE from nfs_lookup_verify_inode() and
    -ESTALE or -ENOENT from ->lookup()
    as indicating an invalid inode. Other errors are returned.

    Also nfs_check_inode_attributes() is changed to return -ESTALE rather
    than -EIO. This is consistent with the error returned in similar
    circumstances from nfs_update_inode().

    As this bug allows any user to unmount a filesystem mounted on an NFS
    filesystem, this fix is suitable for stable kernels.

    Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
    Signed-off-by: NeilBrown
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 4acadda74ff8b949c448c0282765ae747e088c87 upstream.

    When UBIFS prepares data structures which will be written to the MTD it
    ensues that their lengths are multiple of 8. Since it uses kmalloc() the
    padded bytes are left uninitialized and we leak a few bytes of kernel
    memory to the MTD.
    To make sure that all bytes are initialized, let's switch to kzalloc().
    Kzalloc() is fine in this case because the buffers are not huge and in
    the IO path the performance bottleneck is anyway the MTD.

    Fixes: 1e51764a3c2a ("UBIFS: add new flash file system")
    Signed-off-by: Richard Weinberger
    Reviewed-by: Boris Brezillon
    Signed-off-by: Richard Weinberger
    Signed-off-by: Greg Kroah-Hartman

    Richard Weinberger
     
  • commit 51f8f3c4e22535933ef9aecc00e9a6069e051b57 upstream.

    If overlay was mounted by root then quota set for upper layer does not work
    because overlay now always use mounter's credentials for operations.
    Also overlay might deplete reserved space and inodes in ext4.

    This patch drops capability SYS_RESOURCE from saved credentials.
    This affects creation new files, whiteouts, and copy-up operations.

    Signed-off-by: Konstantin Khlebnikov
    Fixes: 1175b6b8d963 ("ovl: do operations on underlying file system in mounter's context")
    Cc: Vivek Goyal
    Signed-off-by: Miklos Szeredi
    Cc: Amir Goldstein
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • commit c925dc162f770578ff4a65ec9b08270382dba9e6 upstream.

    This patch copies commit b7f8a09f80:
    "btrfs: Don't clear SGID when inheriting ACLs" written by Jan.

    Fixes: 073931017b49d9458aa351605b43a7e34598caef
    Signed-off-by: Jan Kara
    Reviewed-by: Chao Yu
    Reviewed-by: Jan Kara
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Greg Kroah-Hartman

    Jaegeuk Kim
     
  • commit 21d3f8e1c3b7996ce239ab6fa82e9f7a8c47d84d upstream.

    Make sure number of entires doesn't exceed max journal size.

    Signed-off-by: Jin Qian
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Greg Kroah-Hartman

    Jin Qian
     
  • commit 8ba358756aa08414fa9e65a1a41d28304ed6fd7f upstream.

    When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by calling __xfs_set_acl() instead of xfs_set_acl() when
    setting up inode in xfs_generic_create(). That prevents SGID bit
    clearing and mode is properly set by posix_acl_create() anyway. We also
    reorder arguments of __xfs_set_acl() to match the ordering of
    xfs_set_acl() to make things consistent.

    Fixes: 073931017b49d9458aa351605b43a7e34598caef
    CC: Darrick J. Wong
    CC: linux-xfs@vger.kernel.org
    Signed-off-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit a992f2d38e4ce17b8c7d1f7f67b2de0eebdea069 upstream.

    When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by creating __ext2_set_acl() function that does not call
    posix_acl_update_mode() and use it when inheriting ACLs. That prevents
    SGID bit clearing and the mode has been properly set by
    posix_acl_create() anyway.

    Fixes: 073931017b49d9458aa351605b43a7e34598caef
    CC: linux-ext4@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit b7f8a09f8097db776b8d160862540e4fc1f51296 upstream.

    When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by moving posix_acl_update_mode() out of
    __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
    called when inheriting ACLs which is what we want as it prevents SGID
    bit clearing and the mode has been properly set by posix_acl_create()
    anyway.

    Fixes: 073931017b49d9458aa351605b43a7e34598caef
    CC: linux-btrfs@vger.kernel.org
    CC: David Sterba
    Signed-off-by: Jan Kara
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

21 Jul, 2017

6 commits

  • commit 296990deb389c7da21c78030376ba244dc1badf5 upstream.

    Andrei Vagin pointed out that time to executue propagate_umount can go
    non-linear (and take a ludicrious amount of time) when the mount
    propogation trees of the mounts to be unmunted by a lazy unmount
    overlap.

    Make the walk of the mount propagation trees nearly linear by
    remembering which mounts have already been visited, allowing
    subsequent walks to detect when walking a mount propgation tree or a
    subtree of a mount propgation tree would be duplicate work and to skip
    them entirely.

    Walk the list of mounts whose propgatation trees need to be traversed
    from the mount highest in the mount tree to mounts lower in the mount
    tree so that odds are higher that the code will walk the largest trees
    first, allowing later tree walks to be skipped entirely.

    Add cleanup_umount_visitation to remover the code's memory of which
    mounts have been visited.

    Add the functions last_slave and skip_propagation_subtree to allow
    skipping appropriate parts of the mount propagation tree without
    needing to change the logic of the rest of the code.

    A script to generate overlapping mount propagation trees:

    $ cat runs.h
    set -e
    mount -t tmpfs zdtm /mnt
    mkdir -p /mnt/1 /mnt/2
    mount -t tmpfs zdtm /mnt/1
    mount --make-shared /mnt/1
    mkdir /mnt/1/1

    iteration=10
    if [ -n "$1" ] ; then
    iteration=$1
    fi

    for i in $(seq $iteration); do
    mount --bind /mnt/1/1 /mnt/1/1
    done

    mount --rbind /mnt/1 /mnt/2

    TIMEFORMAT='%Rs'
    nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 ))
    echo -n "umount -l /mnt/1 -> $nr "
    time umount -l /mnt/1

    nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l )
    time umount -l /mnt/2

    $ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done

    Here are the performance numbers with and without the patch:

    mhash | 8192 | 8192 | 1048576 | 1048576
    mounts | before | after | before | after
    ------------------------------------------------
    1025 | 0.040s | 0.016s | 0.038s | 0.019s
    2049 | 0.094s | 0.017s | 0.080s | 0.018s
    4097 | 0.243s | 0.019s | 0.206s | 0.023s
    8193 | 1.202s | 0.028s | 1.562s | 0.032s
    16385 | 9.635s | 0.036s | 9.952s | 0.041s
    32769 | 60.928s | 0.063s | 44.321s | 0.064s
    65537 | | 0.097s | | 0.097s
    131073 | | 0.233s | | 0.176s
    262145 | | 0.653s | | 0.344s
    524289 | | 2.305s | | 0.735s
    1048577 | | 7.107s | | 2.603s

    Andrei Vagin reports fixing the performance problem is part of the
    work to fix CVE-2016-6213.

    Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount")
    Reported-by: Andrei Vagin
    Reviewed-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 99b19d16471e9c3faa85cad38abc9cbbe04c6d55 upstream.

    While investigating some poor umount performance I realized that in
    the case of overlapping mount trees where some of the mounts are locked
    the code has been failing to unmount all of the mounts it should
    have been unmounting.

    This failure to unmount all of the necessary
    mounts can be reproduced with:

    $ cat locked_mounts_test.sh

    mount -t tmpfs test-base /mnt
    mount --make-shared /mnt
    mkdir -p /mnt/b

    mount -t tmpfs test1 /mnt/b
    mount --make-shared /mnt/b
    mkdir -p /mnt/b/10

    mount -t tmpfs test2 /mnt/b/10
    mount --make-shared /mnt/b/10
    mkdir -p /mnt/b/10/20

    mount --rbind /mnt/b /mnt/b/10/20

    unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
    sleep 1
    umount -l /mnt/b
    wait %%

    $ unshare -Urm ./locked_mounts_test.sh

    This failure is corrected by removing the prepass that marks mounts
    that may be umounted.

    A first pass is added that umounts mounts if possible and if not sets
    mount mark if they could be unmounted if they weren't locked and adds
    them to a list to umount possibilities. This first pass reconsiders
    the mounts parent if it is on the list of umount possibilities, ensuring
    that information of umoutability will pass from child to mount parent.

    A second pass then walks through all mounts that are umounted and processes
    their children unmounting them or marking them for reparenting.

    A last pass cleans up the state on the mounts that could not be umounted
    and if applicable reparents them to their first parent that remained
    mounted.

    While a bit longer than the old code this code is much more robust
    as it allows information to flow up from the leaves and down
    from the trunk making the order in which mounts are encountered
    in the umount propgation tree irrelevant.

    Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
    Reviewed-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 570487d3faf2a1d8a220e6ee10f472163123d7da upstream.

    It was observed that in some pathlogical cases that the current code
    does not unmount everything it should. After investigation it
    was determined that the issue is that mnt_change_mntpoint can
    can change which mounts are available to be unmounted during mount
    propagation which is wrong.

    The trivial reproducer is:
    $ cat ./pathological.sh

    mount -t tmpfs test-base /mnt
    cd /mnt
    mkdir 1 2 1/1
    mount --bind 1 1
    mount --make-shared 1
    mount --bind 1 2
    mount --bind 1/1 1/1
    mount --bind 1/1 1/1
    echo
    grep test-base /proc/self/mountinfo
    umount 1/1
    echo
    grep test-base /proc/self/mountinfo

    $ unshare -Urm ./pathological.sh

    The expected output looks like:
    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    The output without the fix looks like:
    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    That last mount in the output was in the propgation tree to be unmounted but
    was missed because the mnt_change_mountpoint changed it's parent before the walk
    through the mount propagation tree observed it.

    Fixes: 1064f874abc0 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
    Acked-by: Andrei Vagin
    Reviewed-by: Ram Pai
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit da029c11e6b12f321f36dac8771e833b65cec962 upstream.

    To avoid pathological stack usage or the need to special-case setuid
    execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).

    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit eab09532d40090698b05a07c1c87f39fdbc5fab5 upstream.

    The ELF_ET_DYN_BASE position was originally intended to keep loaders
    away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2
    /bin/cat" might cause the subsequent load of /bin/cat into where the
    loader had been loaded.)

    With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
    ELF_ET_DYN_BASE continued to be used since the kernel was only looking
    at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the
    top 1/3rd of the TASK_SIZE, a substantial portion of the address space
    is unused.

    For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
    loaded above the mmap region. This means they can be made to collide
    (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
    pathological stack regions.

    Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
    region in all cases, and will now additionally avoid programs falling
    back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
    if it would have collided with the stack, now it will fail to load
    instead of falling back to the mmap region).

    To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
    are loaded into the mmap region, leaving space available for either an
    ET_EXEC binary with a fixed location or PIE being loaded into mmap by
    the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
    which means architectures can now safely lower their values without risk
    of loaders colliding with their subsequently loaded programs.

    For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
    the entire 32-bit address space for 32-bit pointers.

    Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
    suggestions on how to implement this solution.

    Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR")
    Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Cc: Daniel Micay
    Cc: Qualys Security Advisory
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Dmitry Safonov
    Cc: Andy Lutomirski
    Cc: Grzegorz Andrejczuk
    Cc: Masahiro Yamada
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: James Hogan
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Pratyush Anand
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit b17c070fb624cf10162cf92ea5e1ec25cd8ac176 upstream.

    __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
    duration if there are more number of items in the lru list. As per the
    current code, it can hold the spin lock for upto maximum UINT_MAX
    entries at a time. So if there are more number of items in the lru
    list, then "BUG: spinlock lockup suspected" is observed in the below
    path:

    spin_bug+0x90
    do_raw_spin_lock+0xfc
    _raw_spin_lock+0x28
    list_lru_add+0x28
    dput+0x1c8
    path_put+0x20
    terminate_walk+0x3c
    path_lookupat+0x100
    filename_lookup+0x6c
    user_path_at_empty+0x54
    SyS_faccessat+0xd0
    el0_svc_naked+0x24

    This nlru->lock is acquired by another CPU in this path -

    d_lru_shrink_move+0x34
    dentry_lru_isolate_shrink+0x48
    __list_lru_walk_one.isra.10+0x94
    list_lru_walk_node+0x40
    shrink_dcache_sb+0x60
    do_remount_sb+0xbc
    do_emergency_remount+0xb0
    process_one_work+0x228
    worker_thread+0x2e0
    kthread+0xf4
    ret_from_fork+0x10

    Fix this lockup by reducing the number of entries to be shrinked from
    the lru list to 1024 at once. Also, add cond_resched() before
    processing the lru list again.

    Link: http://marc.info/?t=149722864900001&r=1&w=2
    Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org
    Signed-off-by: Sahitya Tummala
    Suggested-by: Jan Kara
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Alexander Polakov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sahitya Tummala
     

15 Jul, 2017

1 commit

  • commit 1ea1516fbbab2b30bf98c534ecaacba579a35208 upstream.

    kstrtoull returns 0 on success, however, in reserved_clusters_store we
    will return -EINVAL if kstrtoull returns 0, it makes us fail to update
    reserved_clusters value through sysfs.

    Fixes: 76d33bca5581b1dd5c3157fa168db849a784ada4
    Signed-off-by: Chao Yu
    Signed-off-by: Miao Xie
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     

12 Jul, 2017

4 commits

  • commit 961ae1d83d055a4b9ebbfb4cc8ca62ec1a7a3b74 upstream.

    Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
    glocks were freed via call_rcu to allow reading the glock hashtable
    locklessly using rcu. This was then changed to free glocks immediately,
    which made reading the glock hashtable unsafe. Bring back the original
    code for freeing glocks via call_rcu.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson
    Signed-off-by: Greg Kroah-Hartman

    Andreas Gruenbacher
     
  • commit b50c2de51e611da90cf3cf04c058f7e9bbe79e93 upstream.

    The dirfragtree is lazily updated, it's not always accurate. Infinite
    loops happens in following circumstance.

    - client send request to read frag A
    - frag A has been fragmented into frag B and C. So mds fills the reply
    with contents of frag B
    - client wants to read next frag C. ceph_choose_frag(frag value of C)
    return frag A.

    The fix is using previous readdir reply to calculate next readdir frag
    when possible.

    Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Greg Kroah-Hartman

    Yan, Zheng
     
  • commit 629e014bb8349fcf7c1e4df19a842652ece1c945 upstream.

    Currently we just stash anything we got into file->f_flags, and the
    report it in fcntl(F_GETFD). This patch just clears out all unknown
    flags so that we don't pass them to the fs or report them.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     
  • commit 80f18379a7c350c011d30332658aa15fe49a8fa5 upstream.

    Add a central define for all valid open flags, and use it in the uniqueness
    check.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     

05 Jul, 2017

5 commits

  • commit 33496c3c3d7b88dcbe5e55aa01288b05646c6aca upstream.

    Configfs is the interface for ocfs2-tools to set configure to kernel and
    $configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
    used to configure heartbeat dead threshold. Kernel has a default value
    of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
    to override it.

    Commit 45b997737a80 ("ocfs2/cluster: use per-attribute show and store
    methods") changed heartbeat dead threshold name while ocfs2-tools did
    not, so ocfs2-tools won't set this configurable and the default value is
    always used. So revert it.

    Fixes: 45b997737a80 ("ocfs2/cluster: use per-attribute show and store methods")
    Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Acked-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Junxiao Bi
     
  • [ Upstream commit 4d22c75d4c7b5c5f4bd31054f09103ee490878fd ]

    If the last section of a core file ends with an unmapped or zero page,
    the size of the file does not correspond with the last dump_skip() call.
    gdb complains that the file is truncated and can be confusing to users.

    After all of the vma sections are written, make sure that the file size
    is no smaller than the current file position.

    This problem can be demonstrated with gdb's bigcore testcase on the
    sparc architecture.

    Signed-off-by: Dave Kleikamp
    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Dave Kleikamp
     
  • [ Upstream commit a12f1ae61c489076a9aeb90bddca7722bf330df3 ]

    lockdep reports a warnning. file_start_write/file_end_write only
    acquire/release the lock for regular files. So checking the files in aio
    side too.

    [ 453.532141] ------------[ cut here ]------------
    [ 453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
    [ 453.533011] DEBUG_LOCKS_WARN_ON(depth ] dump_stack+0x67/0x9c
    [ 453.533011] [] __warn+0x111/0x130
    [ 453.533011] [] warn_slowpath_fmt+0x97/0xb0
    [ 453.533011] [] ? __warn+0x130/0x130
    [ 453.533011] [] ? blk_finish_plug+0x29/0x60
    [ 453.533011] [] lock_release+0x434/0x670
    [ 453.533011] [] ? import_single_range+0xd4/0x110
    [ 453.533011] [] ? rw_verify_area+0x65/0x140
    [ 453.533011] [] ? aio_write+0x1f6/0x280
    [ 453.533011] [] aio_write+0x229/0x280
    [ 453.533011] [] ? aio_complete+0x640/0x640
    [ 453.533011] [] ? debug_check_no_locks_freed+0x1a0/0x1a0
    [ 453.533011] [] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
    [ 453.533011] [] ? debug_lockdep_rcu_enabled+0x35/0x40
    [ 453.533011] [] ? __might_fault+0x7e/0xf0
    [ 453.533011] [] do_io_submit+0x94c/0xb10
    [ 453.533011] [] ? do_io_submit+0x23e/0xb10
    [ 453.533011] [] ? SyS_io_destroy+0x270/0x270
    [ 453.533011] [] ? mark_held_locks+0x23/0xc0
    [ 453.533011] [] ? trace_hardirqs_on_thunk+0x1a/0x1c
    [ 453.533011] [] SyS_io_submit+0x10/0x20
    [ 453.533011] [] entry_SYSCALL_64_fastpath+0x18/0xad
    [ 453.533011] [] ? trace_hardirqs_off_caller+0xc0/0x110
    [ 453.533011] ---[ end trace b2fbe664d1cc0082 ]---

    Cc: Dmitry Monakhov
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Cc: Al Viro
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     
  • [ Upstream commit 91298eec05cd8d4e828cf7ee5d4a6334f70cf69a ]

    For such a file mapping,

    [0-4k][hole][8k-12k]

    In NO_HOLES mode, we don't have the [hole] extent any more.
    Commit c1aa45759e90 ("Btrfs: fix shrinking truncate when the no_holes feature is enabled")
    fixed disk isize not being updated in NO_HOLES mode when data is not flushed.

    However, even if data has been flushed, we can still have trouble
    in updating disk isize since we updated disk isize to 'start' of
    the last evicted extent.

    Reviewed-by: Chris Mason
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 97dcdea076ecef41ea4aaa23d4397c2f622e4265 ]

    The following deadlock is seen when executing generic/113 test,

    ---------------------------------------------------------+----------------------------------------------------
    Direct I/O task Fast fsync task
    ---------------------------------------------------------+----------------------------------------------------
    btrfs_direct_IO
    __blockdev_direct_IO
    do_blockdev_direct_IO
    do_direct_IO
    btrfs_get_blocks_direct
    while (blocks needs to written)
    get_more_blocks (first iteration)
    btrfs_get_blocks_direct
    btrfs_create_dio_extent
    down_read(&BTRFS_I(inode) >dio_sem)
    Create and add extent map and ordered extent
    up_read(&BTRFS_I(inode) >dio_sem)
    btrfs_sync_file
    btrfs_log_dentry_safe
    btrfs_log_inode_parent
    btrfs_log_inode
    btrfs_log_changed_extents
    down_write(&BTRFS_I(inode) >dio_sem)
    Collect new extent maps and ordered extents
    wait for ordered extent completion
    get_more_blocks (second iteration)
    btrfs_get_blocks_direct
    btrfs_create_dio_extent
    down_read(&BTRFS_I(inode) >dio_sem)
    --------------------------------------------------------------------------------------------------------------

    In the above description, Btrfs direct I/O code path has not yet started
    submitting bios for file range covered by the initial ordered
    extent. Meanwhile, The fast fsync task obtains the write semaphore and
    waits for I/O on the ordered extent to get completed. However, the
    Direct I/O task is now blocked on obtaining the read semaphore.

    To resolve the deadlock, this commit modifies the Direct I/O code path
    to obtain the read semaphore before invoking
    __blockdev_direct_IO(). The semaphore is then given up after
    __blockdev_direct_IO() returns. This allows the Direct I/O code to
    complete I/O on all the ordered extents it creates.

    Signed-off-by: Chandan Rajendra
    Reviewed-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chandan Rajendra