28 Oct, 2016

1 commit

  • When kmem accounting switched from account by default to only account if
    flagged by __GFP_ACCOUNT, IPC mqueue and messages was left out.

    The production use case at hand is that mqueues should be customizable
    via sysctls in Docker containers in a Kubernetes cluster. This can only
    be safely allowed to the users of the cluster (without the risk that
    they can cause resource shortage on a node, influencing other users'
    containers) if all resources they control are bounded, i.e. accounted
    for.

    Link: http://lkml.kernel.org/r/1476806075-1210-1-git-send-email-arozansk@redhat.com
    Signed-off-by: Aristeu Rozanski
    Reported-by: Stefan Schimanski
    Acked-by: Davidlohr Bueso
    Cc: Alexey Dobriyan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Stefan Schimanski
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aristeu Rozanski
     

12 Oct, 2016

6 commits

  • In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
    exit_sem. Apparently it's possible for the loop to take quite a long time
    and it doesn't have a scheduling point in it. Since the codes is
    executing under an rcu read section this may also cause rcu stalls, which
    in turn block synchronize_rcu operations, which more or less de-stabilises
    the whole system.

    Fix this by introducing a cond_resched() at the beginning of the loop.

    So this patch fixes the following:

    NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
    CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6
    Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
    task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
    RIP: 0010:[] [] _raw_spin_lock+0x17/0x30
    RSP: 0018:ffff881c95553e40 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
    RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
    RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
    R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
    R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
    FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
    Call Trace:
    ? exit_sem+0x7c/0x280
    do_exit+0x338/0xb40
    do_group_exit+0x43/0xd0
    SyS_exit_group+0x14/0x20
    entry_SYSCALL_64_fastpath+0x16/0x6e

    Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com
    Signed-off-by: Nikolay Borisov
    Cc: Herton R. Krzesinski
    Cc: Fabian Frederick
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     
  • Blocked tasks queued in q_senders waiting for their message to fit in the
    queue are blindly awoken every time we think there's a remote chance this
    might happen. This could cause numerous (and expensive -- thundering
    herd-ish) bogus wakeups if the queue is still really full. Adding to the
    scheduling cost/overhead, there's also the fact that we need to take the
    ipc object lock and requeue ourselves in the q_senders list.

    By keeping track of the blocked sender's message size, we can know
    previously if the wakeup ought to occur or not. Otherwise, to maintain
    the current wakeup order we just move it to the tail. This is exactly
    what occurs right now if the sender needs to go back to sleep.

    The case of EIDRM is left completely untouched, as we need to wakeup all
    the tasks, and shouldn't be playing games in the first place.

    This patch was seen to save on the 'msgctl10' ltp testcase ~15% in context
    switches (avg out of ten runs). Although these tests are really about
    functionality (as opposed to performance), is does show the direct
    benefits of the optimization.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1469748819-19484-6-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Cc: Manfred Spraul
    Cc: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... 'tis annoying.

    Link: http://lkml.kernel.org/r/1469748819-19484-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Cc: Manfred Spraul
    Cc: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Currently the use of wake_qs in sysv msg queues are only for the receiver
    tasks that are blocked on the queue. But blocked sender tasks (due to
    queue size constraints) still are awoken with the ipc object lock held,
    which can be a problem particularly for small sized queues and far from
    gracious for -rt (just like it was for the receiver side).

    The paths that actually wakeup a sender are obviously related to when we
    are either getting rid of the queue or after (some) space is freed-up
    after a receiver takes the msg (msgrcv). Furthermore, with the exception
    of msgrcv, we can always piggy-back on expunge_all that has its own tasks
    lined-up for waking. Finally, upon unlinking the message, it should be no
    problem delaying the wakeups a bit until after we've released the lock.

    Link: http://lkml.kernel.org/r/1469748819-19484-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Cc: Manfred Spraul
    Cc: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This patch moves the wakeup_process() invocation so it is not done under
    the ipc global lock by making use of a lockless wake_q. With this change,
    the waiter is woken up once the message has been assigned and it does not
    need to loop on SMP if the message points to NULL. In the signal case we
    still need to check the pointer under the lock to verify the state.

    This change should also avoid the introduction of preempt_disable() in -RT
    which avoids a busy-loop which pools for the NULL -> !NULL change if the
    waiter has a higher priority compared to the waker.

    By making use of wake_qs, the logic of sysv msg queues is greatly
    simplified (and very well suited as we can batch lockless wakeups),
    particularly around the lockless receive algorithm.

    This has been tested with Manred's pmsg-shared tool on a "AMD A10-7800
    Radeon R7, 12 Compute Cores 4C+8G":

    test | before | after | diff
    -----------------|------------|------------|----------
    pmsg-shared 8 60 | 19,347,422 | 30,442,191 | + ~57.34 %
    pmsg-shared 4 60 | 21,367,197 | 35,743,458 | + ~67.28 %
    pmsg-shared 2 60 | 22,884,224 | 24,278,200 | + ~6.09 %

    Link: http://lkml.kernel.org/r/1469748819-19484-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
    race:

    sem_lock has a fast path that allows parallel simple operations.
    There are two reasons why a simple operation cannot run in parallel:
    - a non-simple operations is ongoing (sma->sem_perm.lock held)
    - a complex operation is sleeping (sma->complex_count != 0)

    As both facts are stored independently, a thread can bypass the current
    checks by sleeping in the right positions. See below for more details
    (or kernel bugzilla 105651).

    The patch fixes that by creating one variable (complex_mode)
    that tracks both reasons why parallel operations are not possible.

    The patch also updates stale documentation regarding the locking.

    With regards to stable kernels:
    The patch is required for all kernels that include the
    commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

    The alternative is to revert the patch that introduced the race.

    The patch is safe for backporting, i.e. it makes no assumptions
    about memory barriers in spin_unlock_wait().

    Background:
    Here is the race of the current implementation:

    Thread A: (simple op)
    - does the first "sma->complex_count == 0" test

    Thread B: (complex op)
    - does sem_lock(): This includes an array scan. But the scan can't
    find Thread A, because Thread A does not own sem->lock yet.
    - the thread does the operation, increases complex_count,
    drops sem_lock, sleeps

    Thread A:
    - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
    - sleeps before the complex_count test

    Thread C: (complex op)
    - does sem_lock (no array scan, complex_count==1)
    - wakes up Thread B.
    - decrements complex_count

    Thread A:
    - does the complex_count test

    Bug:
    Now both thread A and thread C operate on the same array, without
    any synchronization.

    Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
    Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
    Reported-by:
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc:
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

11 Oct, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     

28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

23 Sep, 2016

3 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • The current error codes returned when a the per user per user
    namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
    asked for advice on linux-api and it we made clear that those were
    the wrong error code, but a correct effor code was not suggested.

    The best general error code I have found for hitting a resource limit
    is ENOSPC. It is not perfect but as it is unambiguous it will serve
    until someone comes up with a better error code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Aug, 2016

1 commit


03 Aug, 2016

2 commits

  • Write-only variable.

    Link: http://lkml.kernel.org/r/20160708214356.GA6785@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
    to receive rcu freeing function but used generic ipc_rcu_free() instead
    of msg_rcu_free() which does security cleaning.

    Running LTP msgsnd06 with kmemleak gives the following:

    cat /sys/kernel/debug/kmemleak

    unreferenced object 0xffff88003c0a11f8 (size 8):
    comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s)
    hex dump (first 8 bytes):
    1b 00 00 00 01 00 00 00 ........
    backtrace:
    kmemleak_alloc+0x23/0x40
    kmem_cache_alloc_trace+0xe1/0x180
    selinux_msg_queue_alloc_security+0x3f/0xd0
    security_msg_queue_alloc+0x2e/0x40
    newque+0x4e/0x150
    ipcget+0x159/0x1b0
    SyS_msgget+0x39/0x40
    entry_SYSCALL_64_fastpath+0x13/0x8f

    Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to
    only use ipc_rcu_free in case of security allocation failure in newary()

    Fixes: 53dad6d3a8e ("ipc: fix race with LSMs")
    Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

30 Jul, 2016

1 commit

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     

27 Jul, 2016

2 commits

  • We are going to need to call shmem_charge() under tree_lock to get
    accoutning right on collapse of small tmpfs pages into a huge one.

    The problem is that tree_lock is irq-safe and lockdep is not happy, that
    we take irq-unsafe lock under irq-safe[1].

    Let's convert the lock to irq-safe.

    [1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc

    Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Provide a shmem_get_unmapped_area method in file_operations, called at
    mmap time to decide the mapping address. It could be conditional on
    CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
    it unconditional.

    shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
    (which we treat as a black box, highly dependent on architecture and
    config and executable layout). Lots of conditions, and in most cases it
    just goes with the address that chose; but when our huge stars are
    rightly aligned, yet that did not provide a suitable address, go back to
    ask for a larger arena, within which to align the mapping suitably.

    There have to be some direct calls to shmem_get_unmapped_area(), not via
    the file_operations: because of the way shmem_zero_setup() is called to
    create a shmem object late in the mmap sequence, when MAP_SHARED is
    requested with MAP_ANONYMOUS or /dev/zero. Though this only matters
    when /proc/sys/vm/shmem_huge has been set.

    Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Kirill A. Shutemov

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Jun, 2016

4 commits

  • Introduce a function may_open_dev that tests MNT_NODEV and a new
    superblock flab SB_I_NODEV. Use this new function in all of the
    places where MNT_NODEV was previously tested.

    Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
    filesystems should never support device nodes, and a simple superblock
    flags makes that very hard to get wrong. With SB_I_NODEV set if any
    device nodes somehow manage to show up on on a filesystem those
    device nodes will be unopenable.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
    do not result in executable on mqueuefs by accident.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Today what is normally called data (the mount options) is not passed
    to fill_super through mount_ns.

    Pass the mount options and the namespace separately to mount_ns so
    that filesystems such as proc that have mount options, can use
    mount_ns.

    Pass the user namespace to mount_ns so that the standard permission
    check that verifies the mounter has permissions over the namespace can
    be performed in mount_ns instead of in each filesystems .mount method.
    Thus removing the duplication between mqueuefs and proc in terms of
    permission checks. The extra permission check does not currently
    affect the rpc_pipefs filesystem and the nfsd filesystem as those
    filesystems do not currently allow unprivileged mounts. Without
    unpvileged mounts it is guaranteed that the caller has already passed
    capable(CAP_SYS_ADMIN) which guarantees extra permission check will
    pass.

    Update rpc_pipefs and the nfsd filesystem to ensure that the network
    namespace reference is always taken in fill_super and always put in kill_sb
    so that the logic is simpler and so that errors originating inside of
    fill_super do not cause a network namespace leak.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Allow the ipc namespace initialization code to depend on ns->user_ns
    being set during initialization.

    In particular this allows mq_init_ns to use ns->user_ns for permission
    checks and initializating s_user_ns while the the mq filesystem is
    being mounted.

    Acked-by: Seth Forshee
    Suggested-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

14 Jun, 2016

2 commits

  • With the modified semantics of spin_unlock_wait() a number of
    explicit barriers can be removed. Also update the comment for the
    do_exit() usecase, as that was somewhat stale/obscure.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Introduce smp_acquire__after_ctrl_dep(), this construct is not
    uncommon, but the lack of this barrier is.

    Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
    semaphore code and the qspinlock code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

24 May, 2016

1 commit

  • shmat and shmdt rely on mmap_sem for write. If the waiting task gets
    killed by the oom killer it would block oom_reaper from asynchronous
    address space reclaim and reduce the chances of timely OOM resolving.
    Wait for the lock in the killable mode and return with EINTR if the task
    got killed while waiting.

    Signed-off-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Mar, 2016

1 commit

  • As indicated by bug#112271, Linux sets the sempid value upon semctl, and
    not only for semop calls. However, within semctl we only do this for
    SETVAL, leaving SETALL without updating the field, and therefore rather
    inconsistent behavior when compared to other Unices.

    There is really no documentation regarding this and therefore users
    should not make assumptions. With this patch, along with updating
    semctl.2 manpages, this scenario should become less ambiguous As such,
    set sempid on SETALL cmd.

    Also update some in-code documentation, specifying where the sempid is
    set.

    Passes ltp and custom testcase where a child (fork) does SETALL to the
    set.

    Signed-off-by: Davidlohr Bueso
    Reported-by: Philip Semanchuk
    Cc: Michael Kerrisk
    Cc: PrasannaKumar Muralidharan
    Cc: Manfred Spraul
    Cc: Herton R. Krzesinski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

19 Feb, 2016

1 commit

  • remap_file_pages(2) emulation can reach file which represents removed
    IPC ID as long as a memory segment is mapped. It breaks expectations of
    IPC subsystem.

    Test case (rewritten to be more human readable, originally autogenerated
    by syzkaller[1]):

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define PAGE_SIZE 4096

    int main()
    {
    int id;
    void *p;

    id = shmget(IPC_PRIVATE, 3 * PAGE_SIZE, 0);
    p = shmat(id, NULL, 0);
    shmctl(id, IPC_RMID, NULL);
    remap_file_pages(p, 3 * PAGE_SIZE, 0, 7, 0);

    return 0;
    }

    The patch changes shm_mmap() and code around shm_lock() to propagate
    locking error back to caller of shm_mmap().

    [1] http://github.com/google/syzkaller

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

2 commits

  • There are many locations that do

    if (memory_was_allocated_by_vmalloc)
    vfree(ptr);
    else
    kfree(ptr);

    but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
    using is_vmalloc_addr(). Unless callers have special reasons, we can
    replace this branch with kvfree(). Please check and reply if you found
    problems.

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Jan Kara
    Acked-by: Russell King
    Reviewed-by: Andreas Dilger
    Acked-by: "Rafael J. Wysocki"
    Acked-by: David Rientjes
    Cc: "Luck, Tony"
    Cc: Oleg Drokin
    Cc: Boris Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

21 Jan, 2016

1 commit

  • Make is_file_shm_hugepages() return bool to improve readability due to
    this particular function only using either one or zero as its return
    value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 Nov, 2015

1 commit

  • d0edd8528362 ("ipc: convert invalid scenarios to use WARN_ON") relaxed the
    nil dst parameter check, originally being a full BUG_ON. However, this
    check seems quite unnecessary when the only purpose is for
    ceckpoint/restore (MSG_COPY flag):

    o The copy variable is set initially to nil, apparently as a way of
    ensuring that prepare_copy is previously called. Which is in fact done,
    unconditionally at the beginning of do_msgrcv.

    o There is no concurrency with 'copy' (stack allocated in do_msgrcv).

    Furthermore, any errors in 'copy' (and thus prepare_copy/copy_msg) should
    always handled by IS_ERR() family. Therefore remove this check altogether
    as it can never occur with the current users.

    Signed-off-by: Davidlohr Bueso
    Cc: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

01 Oct, 2015

1 commit

  • As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
    having initialized the IPC object state. Yes, we initialize the IPC
    object in a locked state, but with all the lockless RCU lookup work,
    that IPC object lock no longer means that the state cannot be seen.

    We already did this for the IPC semaphore code (see commit e8577d1f0329:
    "ipc/sem.c: fully initialize sem_array before making it visible") but we
    clearly forgot about msg and shm.

    Reported-by: Dmitry Vyukov
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Sep, 2015

1 commit

  • Considering Linus' past rants about the (ab)use of BUG in the kernel, I
    took a look at how we deal with such calls in ipc. Given that any errors
    or corruption in ipc code are most likely contained within the set of
    processes participating in the broken mechanisms, there aren't really many
    strong fatal system failure scenarios that would require a BUG call.
    Also, if something is seriously wrong, ipc might not be the place for such
    a BUG either.

    1. For example, recently, a customer hit one of these BUG_ONs in shm
    after failing shm_lock(). A busted ID imho does not merit a BUG_ON,
    and WARN would have been better.

    2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
    I don't see how we can hit this anyway -- at least it should be IS_ERR.
    The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
    first and foremost. We could also probably drop this check altogether.
    Either way, it does not merit a BUG_ON.

    3. No ->fault() callback for the fs getting the corresponding page --
    seems selfish to make the system unusable.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

15 Aug, 2015

3 commits

  • sem_lock() did not properly pair memory barriers:

    !spin_is_locked() and spin_unlock_wait() are both only control barriers.
    The code needs an acquire barrier, otherwise the cpu might perform read
    operations before the lock test.

    As no primitive exists inside and since it seems
    noone wants another primitive, the code creates a local primitive within
    ipc/sem.c.

    With regards to -stable:

    The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
    nop (just a preprocessor redefinition to improve the readability). The
    bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
    starting from 3.10).

    Signed-off-by: Manfred Spraul
    Reported-by: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: "Paul E. McKenney"
    Cc: Kirill Tkhai
    Cc: Ingo Molnar
    Cc: Josh Poimboeuf
    Cc: Davidlohr Bueso
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • After we acquire the sma->sem_perm lock in exit_sem(), we are protected
    against a racing IPC_RMID operation. Also at that point, we are the last
    user of sem_undo_list. Therefore it isn't required that we acquire or use
    ulp->lock.

    Signed-off-by: Herton R. Krzesinski
    Acked-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    CC: Aristeu Rozanski
    Cc: David Jeffery
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Herton R. Krzesinski
     
  • The current semaphore code allows a potential use after free: in
    exit_sem we may free the task's sem_undo_list while there is still
    another task looping through the same semaphore set and cleaning the
    sem_undo list at freeary function (the task called IPC_RMID for the same
    semaphore set).

    For example, with a test program [1] running which keeps forking a lot
    of processes (which then do a semop call with SEM_UNDO flag), and with
    the parent right after removing the semaphore set with IPC_RMID, and a
    kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
    CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
    in the kernel log:

    Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
    000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk
    010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
    Prev obj: start=ffff88003b45c180, len=64
    000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
    010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7....
    Next obj: start=ffff88003b45c200, len=64
    000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
    010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h). 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
    RIP [] spin_dump+0x53/0xc0
    RSP
    ---[ end trace 783ebb76612867a0 ]---
    NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
    Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
    CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
    RIP: native_read_tsc+0x0/0x20
    Call Trace:
    ? delay_tsc+0x40/0x70
    __delay+0xf/0x20
    do_raw_spin_lock+0x96/0x140
    _raw_spin_lock+0xe/0x10
    sem_lock_and_putref+0x11/0x70
    SYSC_semtimedop+0x7bf/0x960
    ? handle_mm_fault+0xbf6/0x1880
    ? dequeue_task_fair+0x79/0x4a0
    ? __do_page_fault+0x19a/0x430
    ? kfree_debugcheck+0x16/0x40
    ? __do_page_fault+0x19a/0x430
    ? __audit_syscall_entry+0xa8/0x100
    ? do_audit_syscall_entry+0x66/0x70
    ? syscall_trace_enter_phase1+0x139/0x160
    SyS_semtimedop+0xe/0x10
    SyS_semop+0x10/0x20
    entry_SYSCALL_64_fastpath+0x12/0x71
    Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
    Kernel panic - not syncing: softlockup: hung tasks

    I wasn't able to trigger any badness on a recent kernel without the
    proper config debugs enabled, however I have softlockup reports on some
    kernel versions, in the semaphore code, which are similar as above (the
    scenario is seen on some servers running IBM DB2 which uses semaphore
    syscalls).

    The patch here fixes the race against freeary, by acquiring or waiting
    on the sem_undo_list lock as necessary (exit_sem can race with freeary,
    while freeary sets un->semid to -1 and removes the same sem_undo from
    list_proc or when it removes the last sem_undo).

    After the patch I'm unable to reproduce the problem using the test case
    [1].

    [1] Test case used below:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define NSEM 1
    #define NSET 5

    int sid[NSET];

    void thread()
    {
    struct sembuf op;
    int s;
    uid_t pid = getuid();

    s = rand() % NSET;
    op.sem_num = pid % NSEM;
    op.sem_op = 1;
    op.sem_flg = SEM_UNDO;

    semop(sid[s], &op, 1);
    exit(EXIT_SUCCESS);
    }

    void create_set()
    {
    int i, j;
    pid_t p;
    union {
    int val;
    struct semid_ds *buf;
    unsigned short int *array;
    struct seminfo *__buf;
    } un;

    /* Create and initialize semaphore set */
    for (i = 0; i < NSET; i++) {
    sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
    if (sid[i] < 0) {
    perror("semget");
    exit(EXIT_FAILURE);
    }
    }
    un.val = 0;
    for (i = 0; i < NSET; i++) {
    for (j = 0; j < NSEM; j++) {
    if (semctl(sid[i], j, SETVAL, un) < 0)
    perror("semctl");
    }
    }

    /* Launch threads that operate on semaphore set */
    for (i = 0; i < NSEM * NSET * NSET; i++) {
    p = fork();
    if (p < 0)
    perror("fork");
    if (p == 0)
    thread();
    }

    /* Free semaphore set */
    for (i = 0; i < NSET; i++) {
    if (semctl(sid[i], NSEM, IPC_RMID))
    perror("IPC_RMID");
    }

    /* Wait for forked processes to exit */
    while (wait(NULL)) {
    if (errno == ECHILD)
    break;
    };
    }

    int main(int argc, char **argv)
    {
    pid_t p;

    srand(time(NULL));

    while (1) {
    p = fork();
    if (p < 0) {
    perror("fork");
    exit(EXIT_FAILURE);
    }
    if (p == 0) {
    create_set();
    goto end;
    }

    /* Wait for forked processes to exit */
    while (wait(NULL)) {
    if (errno == ECHILD)
    break;
    };
    }
    end:
    return 0;
    }

    [akpm@linux-foundation.org: use normal comment layout]
    Signed-off-by: Herton R. Krzesinski
    Acked-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    CC: Aristeu Rozanski
    Cc: David Jeffery
    Cc:
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Herton R. Krzesinski
     

07 Aug, 2015

1 commit

  • The shm implementation internally uses shmem or hugetlbfs inodes for shm
    segments. As these inodes are never directly exposed to userspace and
    only accessed through the shm operations which are already hooked by
    security modules, mark the inodes with the S_PRIVATE flag so that inode
    security initialization and permission checking is skipped.

    This was motivated by the following lockdep warning:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
    -------------------------------------------------------
    httpd/1597 is trying to acquire lock:
    (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc7/0x270
    __might_fault+0x7a/0xa0
    filldir+0x9e/0x130
    xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
    xfs_readdir+0x1b4/0x330 [xfs]
    xfs_file_readdir+0x2b/0x30 [xfs]
    iterate_dir+0x97/0x130
    SyS_getdents+0x91/0x120
    entry_SYSCALL_64_fastpath+0x12/0x76
    -> #2 (&xfs_dir_ilock_class){++++.+}:
    lock_acquire+0xc7/0x270
    down_read_nested+0x57/0xa0
    xfs_ilock+0x167/0x350 [xfs]
    xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
    xfs_attr_get+0xbd/0x190 [xfs]
    xfs_xattr_get+0x3d/0x70 [xfs]
    generic_getxattr+0x4f/0x70
    inode_doinit_with_dentry+0x162/0x670
    sb_finish_set_opts+0xd9/0x230
    selinux_set_mnt_opts+0x35c/0x660
    superblock_doinit+0x77/0xf0
    delayed_superblock_init+0x10/0x20
    iterate_supers+0xb3/0x110
    selinux_complete_init+0x2f/0x40
    security_load_policy+0x103/0x600
    sel_write_load+0xc1/0x750
    __vfs_write+0x37/0x100
    vfs_write+0xa9/0x1a0
    SyS_write+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    ...

    Signed-off-by: Stephen Smalley
    Reported-by: Morten Stevens
    Acked-by: Hugh Dickins
    Acked-by: Paul Moore
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Prarit Bhargava
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley