21 Nov, 2018

3 commits

  • commit 9c8e0a1b683525464a2abe9fb4b54404a50ed2b4 upstream.

    Timothy Baldwin wrote:
    > As per mount_namespaces(7) unprivileged users should not be able to look under mount points:
    >
    > Mounts that come as a single unit from more privileged mount are locked
    > together and may not be separated in a less privileged mount namespace.
    >
    > However they can:
    >
    > 1. Create a mount namespace.
    > 2. In the mount namespace open a file descriptor to the parent of a mount point.
    > 3. Destroy the mount namespace.
    > 4. Use the file descriptor to look under the mount point.
    >
    > I have reproduced this with Linux 4.16.18 and Linux 4.18-rc8.
    >
    > The setup:
    >
    > $ sudo sysctl kernel.unprivileged_userns_clone=1
    > kernel.unprivileged_userns_clone = 1
    > $ mkdir -p A/B/Secret
    > $ sudo mount -t tmpfs hide A/B
    >
    >
    > "Secret" is indeed hidden as expected:
    >
    > $ ls -lR A
    > A:
    > total 0
    > drwxrwxrwt 2 root root 40 Feb 12 21:08 B
    >
    > A/B:
    > total 0
    >
    >
    > The attack revealing "Secret":
    >
    > $ unshare -Umr sh -c "exec unshare -m ls -lR /proc/self/fd/4/ 4
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit df7342b240185d58d3d9665c0bbf0a0f5570ec29 upstream.

    Jonathan Calmels from NVIDIA reported that he's able to bypass the
    mount visibility security check in place in the Linux kernel by using
    a combination of the unbindable property along with the private mount
    propagation option to allow a unprivileged user to see a path which
    was purposefully hidden by the root user.

    Reproducer:
    # Hide a path to all users using a tmpfs
    root@castiana:~# mount -t tmpfs tmpfs /sys/devices/
    root@castiana:~#

    # As an unprivileged user, unshare user namespace and mount namespace
    stgraber@castiana:~$ unshare -U -m -r

    # Confirm the path is still not accessible
    root@castiana:~# ls /sys/devices/

    # Make /sys recursively unbindable and private
    root@castiana:~# mount --make-runbindable /sys
    root@castiana:~# mount --make-private /sys

    # Recursively bind-mount the rest of /sys over to /mnnt
    root@castiana:~# mount --rbind /sys/ /mnt

    # Access our hidden /sys/device as an unprivileged user
    root@castiana:~# ls /mnt/devices/
    breakpoint cpu cstate_core cstate_pkg i915 intel_pt isa kprobe
    LNXSYSTM:00 msr pci0000:00 platform pnp0 power software system
    tracepoint uncore_arb uncore_cbox_0 uncore_cbox_1 uprobe virtual

    Solve this by teaching copy_tree to fail if a mount turns out to be
    both unbindable and locked.

    Cc: stable@vger.kernel.org
    Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
    Reported-by: Jonathan Calmels
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 25d202ed820ee347edec0bf3bf553544556bf64b upstream.

    It was recently pointed out that the one instance of testing MNT_LOCKED
    outside of the namespace_sem is in ksys_umount.

    Fix that by adding a test inside of do_umount with namespace_sem and
    the mount_lock held. As it helps to fail fails the existing test is
    maintained with an additional comment pointing out that it may be racy
    because the locks are not held.

    Cc: stable@vger.kernel.org
    Reported-by: Al Viro
    Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

20 Oct, 2018

1 commit


26 Sep, 2018

1 commit

  • [ Upstream commit a6795a585929d94ca3e931bc8518f8deb8bbe627 ]

    The underlying real file used by overlayfs still contains the overlay path.
    This results in mnt_want_write_file() calls by the filesystem getting
    freeze protection on the wrong inode (the overlayfs one instead of the real
    one).

    Fix by using file_inode(file)->i_sb instead of file->f_path.mnt->mnt_sb.

    Reported-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     

16 Aug, 2018

2 commits

  • commit 119e1ef80ecfe0d1deb6378d4ab41f5b71519de1 upstream.

    __legitimize_mnt() has two problems - one is that in case of success
    the check of mount_lock is not ordered wrt preceding increment of
    refcount, making it possible to have successful __legitimize_mnt()
    on one CPU just before the otherwise final mntpu() on another,
    with __legitimize_mnt() not seeing mntput() taking the lock and
    mntput() not seeing the increment done by __legitimize_mnt().
    Solved by a pair of barriers.

    Another is that failure of __legitimize_mnt() on the second
    read_seqretry() leaves us with reference that'll need to be
    dropped by caller; however, if that races with final mntput()
    we can end up with caller dropping rcu_read_lock() and doing
    mntput() to release that reference - with the first mntput()
    having freed the damn thing just as rcu_read_lock() had been
    dropped. Solution: in "do mntput() yourself" failure case
    grab mount_lock, check if MNT_DOOMED has been set by racing
    final mntput() that has missed our increment and if it has -
    undo the increment and treat that as "failure, caller doesn't
    need to drop anything" case.

    It's not easy to hit - the final mntput() has to come right
    after the first read_seqretry() in __legitimize_mnt() *and*
    manage to miss the increment done by __legitimize_mnt() before
    the second read_seqretry() in there. The things that are almost
    impossible to hit on bare hardware are not impossible on SMP
    KVM, though...

    Reported-by: Oleg Nesterov
    Fixes: 48a066e72d97 ("RCU'd vsfmounts")
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 9ea0a46ca2c318fcc449c1e6b62a7230a17888f1 upstream.

    mntput_no_expire() does the calculation of total refcount under mount_lock;
    unfortunately, the decrement (as well as all increments) are done outside
    of it, leading to false positives in the "are we dropping the last reference"
    test. Consider the following situation:
    * mnt is a lazy-umounted mount, kept alive by two opened files. One
    of those files gets closed. Total refcount of mnt is 2. On CPU 42
    mntput(mnt) (called from __fput()) drops one reference, decrementing component
    * After it has looked at component #0, the process on CPU 0 does
    mntget(), incrementing component #0, gets preempted and gets to run again -
    on CPU 69. There it does mntput(), which drops the reference (component #69)
    and proceeds to spin on mount_lock.
    * On CPU 42 our first mntput() finishes counting. It observes the
    decrement of component #69, but not the increment of component #0. As the
    result, the total it gets is not 1 as it should've been - it's 0. At which
    point we decide that vfsmount needs to be killed and proceed to free it and
    shut the filesystem down. However, there's still another opened file
    on that filesystem, with reference to (now freed) vfsmount, etc. and we are
    screwed.

    It's not a wide race, but it can be reproduced with artificial slowdown of
    the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.

    Fix consists of moving the refcount decrement under mount_lock; the tricky
    part is that we want (and can) keep the fast case (i.e. mount that still
    has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero
    mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
    before that mntput(). IOW, if mntput() observes (under rcu_read_lock())
    a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
    be dropped.

    Reported-by: Jann Horn
    Tested-by: Jann Horn
    Fixes: 48a066e72d97 ("RCU'd vsfmounts")
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

21 Jun, 2018

1 commit

  • [ Upstream commit a9e5b73288cf1595ac2e05cf1acd1924ceea05fa ]

    In do_mount() when the MS_* flags are being converted to MNT_* flags,
    MS_RDONLY got accidentally convered to SB_RDONLY.

    Undo this change.

    Fixes: e462ec50cb5f ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

24 Apr, 2018

1 commit

  • commit 16a34adb9392b2fe4195267475ab5b472e55292c upstream.

    We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
    their copies. As it is, creating a deep stack of bindings of /proc/*/ns/*
    somewhere in a new namespace and exiting yields a stack overflow.

    Cc: stable@kernel.org
    Reported-by: Alexander Aring
    Bisected-by: Kirill Tkhai
    Tested-by: Kirill Tkhai
    Tested-by: Alexander Aring
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

04 Feb, 2018

1 commit

  • commit d7ee946942bdd12394809305e3df05aa4c8b7b8f upstream.

    Since commit e462ec50cb5fa ("VFS: Differentiate mount flags (MS_*) from
    internal superblock flags") the lazytime mount option doesn't get passed
    on anymore.

    Fix the issue by handling the option in do_mount().

    Reviewed-by: Lukas Czerner
    Signed-off-by: Markus Trippelsdorf
    Signed-off-by: Al Viro
    Cc: Holger Hoffstätte
    Signed-off-by: Greg Kroah-Hartman

    Markus Trippelsdorf
     

17 Oct, 2017

1 commit


05 Oct, 2017

1 commit


15 Sep, 2017

2 commits

  • Pull misc leftovers from Al Viro.

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fix the __user misannotations in asm-generic get_user/put_user
    fput: Don't reinvent the wheel but use existing llist API
    namespace.c: Don't reinvent the wheel but use existing llist API

    Linus Torvalds
     
  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

05 Sep, 2017

1 commit

  • Problem with ioctl() is that it's a file operation, yet often used as an
    inode operation (i.e. modify the inode despite the file being opened for
    read-only).

    mnt_want_write_file() is used by filesystems in such cases to get write
    access on an arbitrary open file.

    Since overlayfs lets filesystems do all file operations, including ioctl,
    this can lead to mnt_want_write_file() returning OK for a lower file and
    modification of that lower file.

    This patch prevents modification by checking if the file is from an
    overlayfs lower layer and returning EPERM in that case.

    Need to introduce a mnt_want_write_file_path() variant that still does the
    old thing for inode operations that can do the copy up + modification
    correctly in such cases (fchown, fsetxattr, fremovexattr).

    This does not address the correctness of such ioctls on overlayfs (the
    correct way would be to copy up and attempt to perform ioctl on upper
    file).

    In theory this could be a regression. We very much hope that nobody is
    relying on such a hack in any sane setup.

    While this patch meddles in VFS code, it has no effect on non-overlayfs
    filesystems.

    Reported-by: "zhangyi (F)"
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

28 Aug, 2017

1 commit


17 Jul, 2017

2 commits

  • Differentiate the MS_* flags passed to mount(2) from the internal flags set
    in the super_block's s_flags. s_flags are now called SB_*, with the names
    and the values for the moment mirroring the MS_* flags that they're
    equivalent to.

    In this patch, just the headers are altered and some kernel code where
    blind automated conversion isn't necessarily correct.

    Note that this shows up some interesting issues:

    (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
    MNT_NODEV) without passing this on to the filesystem, but some
    filesystems set such flags anyway.

    (2) The ->remount_fs() methods of some filesystems adjust the *flags
    argument by setting MS_* flags in it, such as MS_NOATIME - but these
    flags are then scrubbed by do_remount_sb() (only the occupants of
    MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
    MS_I_VERSION and MS_LAZYTIME)

    I'm not sure what's the best way to solve all these cases.

    Suggested-by: Al Viro
    Signed-off-by: David Howells

    David Howells
     
  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

11 Jul, 2017

1 commit

  • Kill off s_options, save/replace_mount_options() and generic_show_options()
    as all filesystems now implement ->show_options() for themselves. This
    should make it easier to implement a context-based mount where the mount
    options can be passed individually over a file descriptor.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

07 Jul, 2017

1 commit

  • Update dcache, inode, pid, mountpoint, and mount hash tables to use
    HASH_ZERO, and remove initialization after allocations. In case of
    places where HASH_EARLY was used such as in __pv_init_lock_hash the
    zeroed hash table was already assumed, because memblock zeroes the
    memory.

    CPU: SPARC M6, Memory: 7T
    Before fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 11.798s

    After fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 3.198s

    CPU: Intel Xeon E5-2630, Memory: 2.2T:
    Before fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.245s

    After fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.244s

    Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Babu Moger
    Cc: David Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

06 Jul, 2017

2 commits

  • Clean up line terminal whitespace in fs/namespace.c and fs/super.c.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Pull mnt namespace updates from Eric Biederman:
    "A big break-through came during this development cycle as a way was
    found to maintain the existing umount -l semantics while allowing for
    optimizations that improve the performance. That is represented by the
    first change in this series moving the reparenting of mounts into
    their own pass. This has allowed addressing the horrific performance
    of umount -l on a carefully crafted tree of mounts with locks held
    (0.06s vs 60s in my testing). What allowed this was not changing where
    umounts propagate to while propgating umounts.

    The next change fixes the case where the order of the mount whose
    umount are being progated visits a tree where the mounts are stacked
    upon each other in another order. This is weird but not hard to
    implement.

    The final change takes advantage of the unchanging mount propgation
    tree to skip parts of the mount propgation tree that have already been
    visited. Yielding a very nice speed up in the worst case.

    There remains one outstanding question about the semantics of umount -l
    that I am still discussiong with Ram Pai. In practice that area of the
    semantics was changed by 1064f874abc0 ("mnt: Tuck mounts under others
    instead of creating shadow/side mounts.") and no regressions have been
    reported. Still I intend to finish talking that out with him to ensure
    there is not something a more intense use of mount propagation in the
    future will not cause to become significant"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    mnt: Make propagate_umount less slow for overlapping mount propagation trees
    mnt: In propgate_umount handle visiting mounts in any order
    mnt: In umount propagation reparent in a separate pass

    Linus Torvalds
     

15 Jun, 2017

1 commit


23 May, 2017

2 commits

  • While investigating some poor umount performance I realized that in
    the case of overlapping mount trees where some of the mounts are locked
    the code has been failing to unmount all of the mounts it should
    have been unmounting.

    This failure to unmount all of the necessary
    mounts can be reproduced with:

    $ cat locked_mounts_test.sh

    mount -t tmpfs test-base /mnt
    mount --make-shared /mnt
    mkdir -p /mnt/b

    mount -t tmpfs test1 /mnt/b
    mount --make-shared /mnt/b
    mkdir -p /mnt/b/10

    mount -t tmpfs test2 /mnt/b/10
    mount --make-shared /mnt/b/10
    mkdir -p /mnt/b/10/20

    mount --rbind /mnt/b /mnt/b/10/20

    unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
    sleep 1
    umount -l /mnt/b
    wait %%

    $ unshare -Urm ./locked_mounts_test.sh

    This failure is corrected by removing the prepass that marks mounts
    that may be umounted.

    A first pass is added that umounts mounts if possible and if not sets
    mount mark if they could be unmounted if they weren't locked and adds
    them to a list to umount possibilities. This first pass reconsiders
    the mounts parent if it is on the list of umount possibilities, ensuring
    that information of umoutability will pass from child to mount parent.

    A second pass then walks through all mounts that are umounted and processes
    their children unmounting them or marking them for reparenting.

    A last pass cleans up the state on the mounts that could not be umounted
    and if applicable reparents them to their first parent that remained
    mounted.

    While a bit longer than the old code this code is much more robust
    as it allows information to flow up from the leaves and down
    from the trunk making the order in which mounts are encountered
    in the umount propgation tree irrelevant.

    Cc: stable@vger.kernel.org
    Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
    Reviewed-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • It was observed that in some pathlogical cases that the current code
    does not unmount everything it should. After investigation it
    was determined that the issue is that mnt_change_mntpoint can
    can change which mounts are available to be unmounted during mount
    propagation which is wrong.

    The trivial reproducer is:
    $ cat ./pathological.sh

    mount -t tmpfs test-base /mnt
    cd /mnt
    mkdir 1 2 1/1
    mount --bind 1 1
    mount --make-shared 1
    mount --bind 1 2
    mount --bind 1/1 1/1
    mount --bind 1/1 1/1
    echo
    grep test-base /proc/self/mountinfo
    umount 1/1
    echo
    grep test-base /proc/self/mountinfo

    $ unshare -Urm ./pathological.sh

    The expected output looks like:
    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    The output without the fix looks like:
    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    That last mount in the output was in the propgation tree to be unmounted but
    was missed because the mnt_change_mountpoint changed it's parent before the walk
    through the mount propagation tree observed it.

    Cc: stable@vger.kernel.org
    Fixes: 1064f874abc0 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
    Acked-by: Andrei Vagin
    Reviewed-by: Ram Pai
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

13 May, 2017

1 commit

  • Pull misc vfs updates from Al Viro:
    "Making sure that something like a referral point won't end up as pwd
    or root.

    The main part is the last commit (fixing mntns_install()); that one
    fixes a hard-to-hit race. The fchdir() commit is making fchdir(2) a
    bit more robust - it should be impossible to get opened files (even
    O_PATH ones) for referral points in the first place, so the existing
    checks are OK, but checking the same thing as in chdir(2) is just as
    cheap.

    The path_init() commit removes a redundant check that shouldn't have
    been there in the first place"

    * 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    make sure that mntns_install() doesn't end up with referral for root
    path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
    make sure that fchdir() won't accept referral points, etc.

    Linus Torvalds
     

22 Apr, 2017

1 commit


10 Apr, 2017

2 commits

  • Currently we free fsnotify_mark_connector structure only when inode /
    vfsmount is getting freed. This can however impose noticeable memory
    overhead when marks get attached to inodes only temporarily. So free the
    connector structure once the last mark is detached from the object.
    Since notification infrastructure can be working with the connector
    under the protection of fsnotify_mark_srcu, we have to be careful and
    free the fsnotify_mark_connector only after SRCU period passes.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently notification marks are attached to object (inode or vfsmnt) by
    a hlist_head in the object. The list is also protected by a spinlock in
    the object. So while there is any mark attached to the list of marks,
    the object must be pinned in memory (and thus e.g. last iput() deleting
    inode cannot happen). Also for list iteration in fsnotify() to work, we
    must hold fsnotify_mark_srcu lock so that mark itself and
    mark->obj_list.next cannot get freed. Thus we are required to wait for
    response to fanotify events from userspace process with
    fsnotify_mark_srcu lock held. That causes issues when userspace process
    is buggy and does not reply to some event - basically the whole
    notification subsystem gets eventually stuck.

    So to be able to drop fsnotify_mark_srcu lock while waiting for
    response, we have to pin the mark in memory and make sure it stays in
    the object list (as removing the mark waiting for response could lead to
    lost notification events for groups later in the list). However we don't
    want inode reclaim to block on such mark as that would lead to system
    just locking up elsewhere.

    This commit is the first in the series that paves way towards solving
    these conflicting lifetime needs. Instead of anchoring the list of marks
    directly in the object, we anchor it in a dedicated structure
    (fsnotify_mark_connector) and just point to that structure from the
    object. The following commits will also add spinlock protecting the list
    and object pointer to the structure.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

02 Mar, 2017

2 commits


03 Feb, 2017

1 commit

  • Ever since mount propagation was introduced in cases where a mount in
    propagated to parent mount mountpoint pair that is already in use the
    code has placed the new mount behind the old mount in the mount hash
    table.

    This implementation detail is problematic as it allows creating
    arbitrary length mount hash chains.

    Furthermore it invalidates the constraint maintained elsewhere in the
    mount code that a parent mount and a mountpoint pair will have exactly
    one mount upon them. Making it hard to deal with and to talk about
    this special case in the mount code.

    Modify mount propagation to notice when there is already a mount at
    the parent mount and mountpoint where a new mount is propagating to
    and place that preexisting mount on top of the new mount.

    Modify unmount propagation to notice when a mount that is being
    unmounted has another mount on top of it (and no other children), and
    to replace the unmounted mount with the mount on top of it.

    Move the MNT_UMUONT test from __lookup_mnt_last into
    __propagate_umount as that is the only call of __lookup_mnt_last where
    MNT_UMOUNT may be set on any mount visible in the mount hash table.

    These modifications allow:
    - __lookup_mnt_last to be removed.
    - attach_shadows to be renamed __attach_mnt and its shadow
    handling to be removed.
    - commit_tree to be simplified
    - copy_tree to be simplified

    The result is an easier to understand tree of mounts that does not
    allow creation of arbitrary length hash chains in the mount hash table.

    The result is also a very slight userspace visible difference in semantics.
    The following two cases now behave identically, where before order
    mattered:

    case 1: (explicit user action)
    B is a slave of A
    mount something on A/a , it will propagate to B/a
    and than mount something on B/a

    case 2: (tucked mount)
    B is a slave of A
    mount something on B/a
    and than mount something on A/a

    Histroically umount A/a would fail in case 1 and succeed in case 2.
    Now umount A/a succeeds in both configurations.

    This very small change in semantics appears if anything to be a bug
    fix to me and my survey of userspace leads me to believe that no programs
    will notice or care of this subtle semantic change.

    v2: Updated to mnt_change_mountpoint to not call dput or mntput
    and instead to decrement the counts directly. It is guaranteed
    that there will be other references when mnt_change_mountpoint is
    called so this is safe.

    v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
    As the locking in fs/namespace.c changed between v2 and v3.

    v4: Reworked the logic in propagate_mount_busy and __propagate_umount
    that detects when a mount completely covers another mount.

    v5: Removed unnecessary tests whose result is alwasy true in
    find_topper and attach_recursive_mnt.

    v6: Document the user space visible semantic difference.

    Cc: stable@vger.kernel.org
    Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind")
    Tested-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

01 Feb, 2017

1 commit

  • To support unprivileged users mounting filesystems two permission
    checks have to be performed: a test to see if the user allowed to
    create a mount in the mount namespace, and a test to see if
    the user is allowed to access the specified filesystem.

    The automount case is special in that mounting the original filesystem
    grants permission to mount the sub-filesystems, to any user who
    happens to stumble across the their mountpoint and satisfies the
    ordinary filesystem permission checks.

    Attempting to handle the automount case by using override_creds
    almost works. It preserves the idea that permission to mount
    the original filesystem is permission to mount the sub-filesystem.
    Unfortunately using override_creds messes up the filesystems
    ordinary permission checks.

    Solve this by being explicit that a mount is a submount by introducing
    vfs_submount, and using it where appropriate.

    vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
    sget and friends know that a mount is a submount so they can take appropriate
    action.

    sget and sget_userns are modified to not perform any permission checks
    on submounts.

    follow_automount is modified to stop using override_creds as that
    has proven problemantic.

    do_mount is modified to always remove the new MS_SUBMOUNT flag so
    that we know userspace will never by able to specify it.

    autofs4 is modified to stop using current_real_cred that was put in
    there to handle the previous version of submount permission checking.

    cifs is modified to pass the mountpoint all of the way down to vfs_submount.

    debugfs is modified to pass the mountpoint all of the way down to
    trace_automount by adding a new parameter. To make this change easier
    a new typedef debugfs_automount_t is introduced to capture the type of
    the debugfs automount function.

    Cc: stable@vger.kernel.org
    Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid")
    Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds")
    Reviewed-by: Trond Myklebust
    Reviewed-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Jan, 2017

1 commit

  • Protecting the mountpoint hashtable with namespace_sem was sufficient
    until a call to umount_mnt was added to mntput_no_expire. At which
    point it became possible for multiple calls of put_mountpoint on
    the same hash chain to happen on the same time.

    Kristen Johansen reported:
    > This can cause a panic when simultaneous callers of put_mountpoint
    > attempt to free the same mountpoint. This occurs because some callers
    > hold the mount_hash_lock, while others hold the namespace lock. Some
    > even hold both.
    >
    > In this submitter's case, the panic manifested itself as a GP fault in
    > put_mountpoint() when it called hlist_del() and attempted to dereference
    > a m_hash.pprev that had been poisioned by another thread.

    Al Viro observed that the simple fix is to switch from using the namespace_sem
    to the mount_lock to protect the mountpoint hash table.

    I have taken Al's suggested patch moved put_mountpoint in pivot_root
    (instead of taking mount_lock an additional time), and have replaced
    new_mountpoint with get_mountpoint a function that does the hash table
    lookup and addition under the mount_lock. The introduction of get_mounptoint
    ensures that only the mount_lock is needed to manipulate the mountpoint
    hashtable.

    d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
    already set. This allows get_mountpoint to use the setting of
    DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
    happens exactly once.

    Cc: stable@vger.kernel.org
    Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts")
    Reported-by: Krister Johansen
    Suggested-by: Al Viro
    Acked-by: Al Viro
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Dec, 2016

1 commit


17 Dec, 2016

4 commits