14 May, 2020

1 commit

  • If mounts are deleted after a read(2) call on /proc/self/mounts (or its
    kin), the subsequent read(2) could miss a mount that comes after the
    deleted one in the list. This is because the file position is interpreted
    as the number mount entries from the start of the list.

    E.g. first read gets entries #0 to #9; the seq file index will be 10. Then
    entry #5 is deleted, resulting in #10 becoming #9 and #11 becoming #10,
    etc... The next read will continue from entry #10, and #9 is missed.

    Solve this by adding a cursor entry for each open instance. Taking the
    global namespace_sem for write seems excessive, since we are only dealing
    with a per-namespace list. Instead add a per-namespace spinlock and use
    that together with namespace_sem taken for read to protect against
    concurrent modification of the mount list. This may reduce parallelism of
    is_local_mountpoint(), but it's hardly a big contention point. We could
    also use RCU freeing of cursors to make traversal not need additional
    locks, if that turns out to be neceesary.

    Only move the cursor once for each read (cursor is not added on open) to
    minimize cacheline invalidation. When EOF is reached, the cursor is taken
    off the list, in order to prevent an excessive number of cursors due to
    inactive open file descriptors.

    Reported-by: Karel Zak
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

17 Jul, 2019

2 commits

  • We used to need rather convoluted ordering trickery to guarantee
    that dput() of ex-mountpoints happens before the final mntput()
    of the same. Since we don't need that anymore, there's no point
    playing with fs_pin for that.

    Signed-off-by: Al Viro

    Al Viro
     
  • Using dput_to_list() to shift the contributing reference from ->mnt_mountpoint
    to ->mnt_mp->m_dentry. Dentries are dropped (with dput_to_list()) as soon
    as struct mountpoint is destroyed; in cases where we are under namespace_sem
    we use the global list, shrinking it in namespace_unlock(). In case of
    detaching stuck MNT_LOCKed children at final mntput_no_expire() we use a local
    list and shrink it ourselves. ->mnt_ex_mountpoint crap is gone.

    Signed-off-by: Al Viro

    Al Viro
     

31 Jan, 2019

1 commit

  • mount_subtree() creates (and soon destroys) a temporary namespace,
    so that automounts could function normally. These beasts should
    never become anyone's current namespaces; they don't, but it would
    be better to make prevention of that more straightforward. And
    since they don't become anyone's current namespace, we don't need
    to bother with reserving procfs inums for those.

    Teach alloc_mnt_ns() to skip inum allocation if told so, adjust
    put_mnt_ns() accordingly, make mount_subtree() use temporary
    (anon) namespace. is_anon_ns() checks if a namespace is such.

    Signed-off-by: Al Viro

    Al Viro
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

19 Jul, 2017

1 commit

  • Pull structure randomization updates from Kees Cook:
    "Now that IPC and other changes have landed, enable manual markings for
    randstruct plugin, including the task_struct.

    This is the rest of what was staged in -next for the gcc-plugins, and
    comes in three patches, largest first:

    - mark "easy" structs with __randomize_layout

    - mark task_struct with an optional anonymous struct to isolate the
    __randomize_layout section

    - mark structs to opt _out_ of automated marking (which will come
    later)

    And, FWIW, this continues to pass allmodconfig (normal and patched to
    enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
    s390 for me"

    * tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    randstruct: opt-out externally exposed function pointer structs
    task_struct: Allow randomized layout
    randstruct: Mark various structs for randomization

    Linus Torvalds
     

01 Jul, 2017

1 commit

  • This marks many critical kernel structures for randomization. These are
    structures that have been targeted in the past in security exploits, or
    contain functions pointers, pointers to function pointer tables, lists,
    workqueues, ref-counters, credentials, permissions, or are otherwise
    sensitive. This initial list was extracted from Brad Spengler/PaX Team's
    code in the last public patch of grsecurity/PaX based on my understanding
    of the code. Changes or omissions from the original code are mine and
    don't reflect the original grsecurity/PaX code.

    Left out of this list is task_struct, which requires special handling
    and will be covered in a subsequent patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

23 May, 2017

2 commits

  • While investigating some poor umount performance I realized that in
    the case of overlapping mount trees where some of the mounts are locked
    the code has been failing to unmount all of the mounts it should
    have been unmounting.

    This failure to unmount all of the necessary
    mounts can be reproduced with:

    $ cat locked_mounts_test.sh

    mount -t tmpfs test-base /mnt
    mount --make-shared /mnt
    mkdir -p /mnt/b

    mount -t tmpfs test1 /mnt/b
    mount --make-shared /mnt/b
    mkdir -p /mnt/b/10

    mount -t tmpfs test2 /mnt/b/10
    mount --make-shared /mnt/b/10
    mkdir -p /mnt/b/10/20

    mount --rbind /mnt/b /mnt/b/10/20

    unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
    sleep 1
    umount -l /mnt/b
    wait %%

    $ unshare -Urm ./locked_mounts_test.sh

    This failure is corrected by removing the prepass that marks mounts
    that may be umounted.

    A first pass is added that umounts mounts if possible and if not sets
    mount mark if they could be unmounted if they weren't locked and adds
    them to a list to umount possibilities. This first pass reconsiders
    the mounts parent if it is on the list of umount possibilities, ensuring
    that information of umoutability will pass from child to mount parent.

    A second pass then walks through all mounts that are umounted and processes
    their children unmounting them or marking them for reparenting.

    A last pass cleans up the state on the mounts that could not be umounted
    and if applicable reparents them to their first parent that remained
    mounted.

    While a bit longer than the old code this code is much more robust
    as it allows information to flow up from the leaves and down
    from the trunk making the order in which mounts are encountered
    in the umount propgation tree irrelevant.

    Cc: stable@vger.kernel.org
    Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
    Reviewed-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • It was observed that in some pathlogical cases that the current code
    does not unmount everything it should. After investigation it
    was determined that the issue is that mnt_change_mntpoint can
    can change which mounts are available to be unmounted during mount
    propagation which is wrong.

    The trivial reproducer is:
    $ cat ./pathological.sh

    mount -t tmpfs test-base /mnt
    cd /mnt
    mkdir 1 2 1/1
    mount --bind 1 1
    mount --make-shared 1
    mount --bind 1 2
    mount --bind 1/1 1/1
    mount --bind 1/1 1/1
    echo
    grep test-base /proc/self/mountinfo
    umount 1/1
    echo
    grep test-base /proc/self/mountinfo

    $ unshare -Urm ./pathological.sh

    The expected output looks like:
    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    The output without the fix looks like:
    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
    47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
    52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

    That last mount in the output was in the propgation tree to be unmounted but
    was missed because the mnt_change_mountpoint changed it's parent before the walk
    through the mount propagation tree observed it.

    Cc: stable@vger.kernel.org
    Fixes: 1064f874abc0 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
    Acked-by: Andrei Vagin
    Reviewed-by: Ram Pai
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Apr, 2017

2 commits

  • Currently we free fsnotify_mark_connector structure only when inode /
    vfsmount is getting freed. This can however impose noticeable memory
    overhead when marks get attached to inodes only temporarily. So free the
    connector structure once the last mark is detached from the object.
    Since notification infrastructure can be working with the connector
    under the protection of fsnotify_mark_srcu, we have to be careful and
    free the fsnotify_mark_connector only after SRCU period passes.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently notification marks are attached to object (inode or vfsmnt) by
    a hlist_head in the object. The list is also protected by a spinlock in
    the object. So while there is any mark attached to the list of marks,
    the object must be pinned in memory (and thus e.g. last iput() deleting
    inode cannot happen). Also for list iteration in fsnotify() to work, we
    must hold fsnotify_mark_srcu lock so that mark itself and
    mark->obj_list.next cannot get freed. Thus we are required to wait for
    response to fanotify events from userspace process with
    fsnotify_mark_srcu lock held. That causes issues when userspace process
    is buggy and does not reply to some event - basically the whole
    notification subsystem gets eventually stuck.

    So to be able to drop fsnotify_mark_srcu lock while waiting for
    response, we have to pin the mark in memory and make sure it stays in
    the object list (as removing the mark waiting for response could lead to
    lost notification events for groups later in the list). However we don't
    want inode reclaim to block on such mark as that would lead to system
    just locking up elsewhere.

    This commit is the first in the series that paves way towards solving
    these conflicting lifetime needs. Instead of anchoring the list of marks
    directly in the object, we anchor it in a dedicated structure
    (fsnotify_mark_connector) and just point to that structure from the
    object. The following commits will also add spinlock protecting the list
    and object pointer to the structure.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

03 Feb, 2017

1 commit

  • Ever since mount propagation was introduced in cases where a mount in
    propagated to parent mount mountpoint pair that is already in use the
    code has placed the new mount behind the old mount in the mount hash
    table.

    This implementation detail is problematic as it allows creating
    arbitrary length mount hash chains.

    Furthermore it invalidates the constraint maintained elsewhere in the
    mount code that a parent mount and a mountpoint pair will have exactly
    one mount upon them. Making it hard to deal with and to talk about
    this special case in the mount code.

    Modify mount propagation to notice when there is already a mount at
    the parent mount and mountpoint where a new mount is propagating to
    and place that preexisting mount on top of the new mount.

    Modify unmount propagation to notice when a mount that is being
    unmounted has another mount on top of it (and no other children), and
    to replace the unmounted mount with the mount on top of it.

    Move the MNT_UMUONT test from __lookup_mnt_last into
    __propagate_umount as that is the only call of __lookup_mnt_last where
    MNT_UMOUNT may be set on any mount visible in the mount hash table.

    These modifications allow:
    - __lookup_mnt_last to be removed.
    - attach_shadows to be renamed __attach_mnt and its shadow
    handling to be removed.
    - commit_tree to be simplified
    - copy_tree to be simplified

    The result is an easier to understand tree of mounts that does not
    allow creation of arbitrary length hash chains in the mount hash table.

    The result is also a very slight userspace visible difference in semantics.
    The following two cases now behave identically, where before order
    mattered:

    case 1: (explicit user action)
    B is a slave of A
    mount something on A/a , it will propagate to B/a
    and than mount something on B/a

    case 2: (tucked mount)
    B is a slave of A
    mount something on B/a
    and than mount something on A/a

    Histroically umount A/a would fail in case 1 and succeed in case 2.
    Now umount A/a succeeds in both configurations.

    This very small change in semantics appears if anything to be a bug
    fix to me and my survey of userspace leads me to believe that no programs
    will notice or care of this subtle semantic change.

    v2: Updated to mnt_change_mountpoint to not call dput or mntput
    and instead to decrement the counts directly. It is guaranteed
    that there will be other references when mnt_change_mountpoint is
    called so this is safe.

    v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
    As the locking in fs/namespace.c changed between v2 and v3.

    v4: Reworked the logic in propagate_mount_busy and __propagate_umount
    that detects when a mount completely covers another mount.

    v5: Removed unnecessary tests whose result is alwasy true in
    find_topper and attach_recursive_mnt.

    v6: Document the user space visible semantic difference.

    Cc: stable@vger.kernel.org
    Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind")
    Tested-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Dec, 2016

1 commit

  • d_mountpoint() can only be used reliably to establish if a dentry is
    not mounted in any namespace. It isn't aware of the possibility there
    may be multiple mounts using a given dentry that may be in a different
    namespace.

    Add helper functions, path_is_mountpoint(), that checks if a struct path
    is a mountpoint for this case.

    Link: http://lkml.kernel.org/r/20161011053358.27645.9729.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: Omar Sandoval
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Ian Kent
     

01 Oct, 2016

1 commit

  • CAI Qian pointed out that the semantics
    of shared subtrees make it possible to create an exponentially
    increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

    Will create create 2^20 or 1048576 mounts, which is a practical problem
    as some people have managed to hit this by accident.

    As such CVE-2016-6213 was assigned.

    Ian Kent described the situation for autofs users
    as follows:

    > The number of mounts for direct mount maps is usually not very large because of
    > the way they are implemented, large direct mount maps can have performance
    > problems. There can be anywhere from a few (likely case a few hundred) to less
    > than 10000, plus mounts that have been triggered and not yet expired.
    >
    > Indirect mounts have one autofs mount at the root plus the number of mounts that
    > have been triggered and not yet expired.
    >
    > The number of autofs indirect map entries can range from a few to the common
    > case of several thousand and in rare cases up to between 30000 and 50000. I've
    > not heard of people with maps larger than 50000 entries.
    >
    > The larger the number of map entries the greater the possibility for a large
    > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
    > more active mounts.

    So I am setting the default number of mounts allowed per mount
    namespace at 100,000. This is more than enough for any use case I
    know of, but small enough to quickly stop an exponential increase
    in mounts. Which should be perfect to catch misconfigurations and
    malfunctioning programs.

    For anyone who needs a higher limit this can be changed by writing
    to the new /proc/sys/fs/mount-max sysctl.

    Tested-by: CAI Qian
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

31 Aug, 2016

1 commit


01 Jul, 2015

1 commit

  • A patchset to remove support for passing pre-allocated struct seq_file to
    seq_open(). Such feature is undocumented and prone to error.

    In particular, if seq_release() is used in release handler, it will
    kfree() a pointer which was not allocated by seq_open().

    So this patchset drops support for pre-allocated struct seq_file: it's
    only of use in proc_namespace.c and can be easily replaced by using
    seq_open_private()/seq_release_private().

    Additionally, it documents the use of file->private_data to hold pointer
    to struct seq_file by seq_open().

    This patch (of 3):

    Since patch described below, from v2.6.15-rc1, seq_open() could use a
    struct seq_file already allocated by the caller if the pointer to the
    structure is stored in file->private_data before calling the function.

    Commit 1abe77b0fc4b485927f1f798ae81a752677e1d05
    Author: Al Viro
    Date: Mon Nov 7 17:15:34 2005 -0500

    [PATCH] allow callers of seq_open do allocation themselves

    Allow caller of seq_open() to kmalloc() seq_file + whatever else they
    want and set ->private_data to it. seq_open() will then abstain from
    doing allocation itself.

    Such behavior is only used by mounts_open_common().

    In order to drop support for such uncommon feature, proc_mounts is
    converted to use seq_open_private(), which take care of allocating the
    proc_mounts structure, making it available through ->private in struct
    seq_file.

    Conversely, proc_mounts is converted to use seq_release_private(), in
    order to release the private structure allocated by seq_open_private().

    Then, ->private is used directly instead of proc_mounts() macro to access
    to the proc_mounts structure.

    Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     

11 May, 2015

1 commit

  • same as legitimize_mnt(), except that it does *not* drop and regain
    rcu_read_lock; return values are
    0 => grabbed a reference, we are fine
    1 => failed, just go away
    -1 => failed, go away and mntput(bastard) when outside of rcu_read_lock

    Signed-off-by: Al Viro

    Al Viro
     

26 Jan, 2015

1 commit


05 Dec, 2014

1 commit


09 Oct, 2014

4 commits

  • The new function detach_mounts comes in two pieces. The first piece
    is a static inline test of d_mounpoint that returns immediately
    without taking any locks if d_mounpoint is not set. In the common
    case when mountpoints are absent this allows the vfs to continue
    running with it's same cacheline foot print.

    The second piece of detach_mounts __detach_mounts actually does the
    work and it assumes that a mountpoint is present so it is slow and
    takes namespace_sem for write, and then locks the mount hash (aka
    mount_lock) after a struct mountpoint has been found.

    With those two locks held each entry on the list of mounts on a
    mountpoint is selected and lazily unmounted until all of the mount
    have been lazily unmounted.

    v7: Wrote a proper change description and removed the changelog
    documenting deleted wrong turns.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • To spot any possible problems call BUG if a mountpoint
    is put when it's list of mounts is not empty.

    AV: use hlist instead of list_head

    Reviewed-by: Miklos Szeredi
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • In preparation for allowing mountpoints to be renamed and unlinked
    in remote filesystems and in other mount namespaces test if on a dentry
    there is a mount in the local mount namespace before allowing it to
    be renamed or unlinked.

    The primary motivation here are old versions of fusermount unmount
    which is not safe if the a path can be renamed or unlinked while it is
    verifying the mount is safe to unmount. More recent versions are simpler
    and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
    in a directory owned by an arbitrary user.

    Miklos Szeredi reports this is approach is good
    enough to remove concerns about new kernels mixed with old versions
    of fusermount.

    A secondary motivation for restrictions here is that it removing empty
    directories that have non-empty mount points on them appears to
    violate the rule that rmdir can not remove empty directories. As
    Linus Torvalds pointed out this is useful for programs (like git) that
    test if a directory is empty with rmdir.

    Therefore this patch arranges to enforce the existing mount point
    semantics for local mount namespace.

    v2: Rewrote the test to be a drop in replacement for d_mountpoint
    v3: Use bool instead of int as the return type of is_local_mountpoint

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • On final mntput() we want fs shutdown to happen before return to
    userland; however, the only case where we want it happen right
    there (i.e. where task_work_add won't do) is MNT_INTERNAL victim.
    Those have to be fully synchronous - failure halfway through module
    init might count on having vfsmount killed right there. Fortunately,
    final mntput on MNT_INTERNAL vfsmounts happens on shallow stack.
    So we handle those synchronously and do an analog of delayed fput
    logics for everything else.

    As the result, we are guaranteed that fs shutdown will always happen
    on shallow stack.

    Signed-off-by: Al Viro

    Al Viro
     

08 Aug, 2014

2 commits

  • Rather than playing silly buggers with vfsmount refcounts, just have
    acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
    and replace it with said clone. Then attach the pin to original
    vfsmount. Voila - the clone will be alive until the file gets closed,
    making sure that underlying superblock remains active, etc., and
    we can drop the original vfsmount, so that it's not kept busy.
    If the file lives until the final mntput of the original vfsmount,
    we'll notice that there's an fs_pin (one in bsd_acct_struct that
    holds that file) and mnt_pin_kill() will take it out. Since
    ->kill() is synchronous, we won't proceed past that point until
    these files are closed (and private clones of our vfsmount are
    gone), so we get the same ordering warranties we used to get.

    mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
    it never became usable outside of kernel/acct.c (and racy wrt
    umount even there).

    Signed-off-by: Al Viro

    Al Viro
     
  • Put these suckers on per-vfsmount and per-superblock lists instead.
    Note: right now it's still acct_lock for everything, but that's
    going to change.

    Signed-off-by: Al Viro

    Al Viro
     

02 Apr, 2014

1 commit


31 Mar, 2014

2 commits

  • fixes RCU bug - walking through hlist is safe in face of element moves,
    since it's self-terminating. Cyclic lists are not - if we end up jumping
    to another hash chain, we'll loop infinitely without ever hitting the
    original list head.

    [fix for dumb braino folded]

    Spotted by: Max Kellermann
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • * switch allocation to alloc_large_system_hash()
    * make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
    * switch mountpoint_hashtable from list_head to hlist_head

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

26 Jan, 2014

1 commit

  • A bug was introduced with the is_mounted helper function in
    commit f7a99c5b7c8bd3d3f533c8b38274e33f3da9096e
    Author: Al Viro
    Date: Sat Jun 9 00:59:08 2012 -0400

    get rid of ->mnt_longterm

    it's enough to set ->mnt_ns of internal vfsmounts to something
    distinct from all struct mnt_namespace out there; then we can
    just use the check for ->mnt_ns != NULL in the fast path of
    mntput_no_expire()

    Signed-off-by: Al Viro

    The intent was to test if the real_mount(vfsmount)->mnt_ns was
    NULL_OR_ERR but the code is actually testing real_mount(vfsmount)
    and always returning true.

    The result is d_absolute_path returning paths it should be hiding.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     

09 Nov, 2013

1 commit

  • * RCU-delayed freeing of vfsmounts
    * vfsmount_lock replaced with a seqlock (mount_lock)
    * sequence number from mount_lock is stored in nameidata->m_seq and
    used when we exit RCU mode
    * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its
    caller knows that vfsmount will have no surviving references.
    * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
    and doing pending mntput().
    * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence
    number against seq, then grabs reference to mnt. Then it rechecks mount_lock
    again to close the race and either returns success or drops the reference it
    has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can
    simply decrement the refcount and sod off - aforementioned synchronize_rcu()
    makes sure that final mntput() won't come until we leave RCU mode. We need
    that, since we don't want to end up with some lazy pathwalk racing with
    umount() and stealing the final mntput() from it - caller of umount() may
    expect it to return only once the fs is shut down and we don't want to break
    that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
    full-blown mntput() in case of mount_lock sequence number mismatch happening
    just as we'd grabbed the reference, but in those cases we won't be stealing
    the final mntput() from anything that would care.
    * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally,
    SMP and UP cases are handled the same way - no ifdefs there.
    * normal pathname resolution does *not* do any writes to mount_lock. It does,
    of course, bump the refcounts of vfsmount and dentry in the very end, but that's
    it.

    Signed-off-by: Al Viro

    Al Viro
     

25 Oct, 2013

3 commits


10 Apr, 2013

1 commit


20 Nov, 2012

1 commit

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

19 Nov, 2012

2 commits

  • This will allow for support for unprivileged mounts in a new user namespace.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • setns support for the mount namespace is a little tricky as an
    arbitrary decision must be made about what to set fs->root and
    fs->pwd to, as there is no expectation of a relationship between
    the two mount namespaces. Therefore I arbitrarily find the root
    mount point, and follow every mount on top of it to find the top
    of the mount stack. Then I set fs->root and fs->pwd to that
    location. The topmost root of the mount stack seems like a
    reasonable place to be.

    Bind mount support for the mount namespace inodes has the
    possibility of creating circular dependencies between mount
    namespaces. Circular dependencies can result in loops that
    prevent mount namespaces from every being freed. I avoid
    creating those circular dependencies by adding a sequence number
    to the mount namespace and require all bind mounts be of a
    younger mount namespace into an older mount namespace.

    Add a helper function proc_ns_inode so it is possible to
    detect when we are attempting to bind mound a namespace inode.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

14 Jul, 2012

2 commits

  • don't rely on proc_mounts->m being the first field; container_of()
    is there for purpose. No need to bother with ->private, while
    we are at it - the same container_of will do nicely.

    Signed-off-by: Al Viro

    Al Viro
     
  • it's enough to set ->mnt_ns of internal vfsmounts to something
    distinct from all struct mnt_namespace out there; then we can
    just use the check for ->mnt_ns != NULL in the fast path of
    mntput_no_expire()

    Signed-off-by: Al Viro

    Al Viro
     

07 Jan, 2012

1 commit