05 Apr, 2019

1 commit

  • What happens there is that we are replacing file->path.mnt of
    a file we'd just opened with a clone and we need the write
    count contribution to be transferred from original mount to
    new one. That's it. We do *NOT* want any kind of freeze
    protection for the duration of switchover.

    IOW, we should just use __mnt_{want,drop}_write() for that
    switchover; no need to bother with mnt_{want,drop}_write()
    there.

    Tested-by: Amir Goldstein
    Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
    Signed-off-by: Al Viro

    Al Viro
     

31 Jan, 2019

1 commit

  • Create a new helper, vfs_create_mount(), that creates a detached vfsmount
    object from an fs_context that has a superblock attached to it.

    Almost all uses will be paired with immediately preceding vfs_get_tree();
    add a helper for such combination.

    Switch vfs_kern_mount() to use this.

    NOTE: mild behaviour change; passing NULL as 'device name' to
    something like procfs will change /proc/*/mountstats - "device none"
    instead on "no device". That is consistent with /proc/mounts et.al.

    [do'h - EXPORT_SYMBOL_GPL slipped in by mistake; removed]
    [AV -- remove confused comment from vfs_create_mount()]
    [AV -- removed the second argument]

    Reviewed-by: David Howells
    Signed-off-by: Al Viro

    Al Viro
     

21 Dec, 2018

1 commit

  • Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full
    remount because the mount data will get parsed with the new fs_context
    stuff prior to doing a remount - and this causes the syscall to fail under
    some circumstances.

    To quote Eric's explanation:

    [...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options
    string, which breaks systemd unit files with ProtectControlGroups=yes
    (e.g. systemd-networkd.service) when systemd does the following to
    change a cgroup (v1) mount to read-only:

    mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL,
    MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL)

    ... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems
    enabled, since in that case the error "cgroup1: Need name or subsystem
    set" is hit when the mount options string is empty.

    Probably it doesn't make sense to validate the mount options string at
    all in the MS_REMOUNT|MS_BIND case, though maybe you had something else
    in mind.

    This is also worthwhile doing because we will need to add a mount_setattr()
    syscall to take over the remount-bind function.

    Reported-by: Eric Biggers
    Signed-off-by: David Howells
    Signed-off-by: Al Viro
    Reviewed-by: David Howells

    David Howells
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

01 Jul, 2017

1 commit

  • This marks many critical kernel structures for randomization. These are
    structures that have been targeted in the past in security exploits, or
    contain functions pointers, pointers to function pointer tables, lists,
    workqueues, ref-counters, credentials, permissions, or are otherwise
    sensitive. This initial list was extracted from Brad Spengler/PaX Team's
    code in the last public patch of grsecurity/PaX based on my understanding
    of the code. Changes or omissions from the original code are mine and
    don't reflect the original grsecurity/PaX code.

    Left out of this list is task_struct, which requires special handling
    and will be covered in a subsequent patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

01 Feb, 2017

1 commit

  • To support unprivileged users mounting filesystems two permission
    checks have to be performed: a test to see if the user allowed to
    create a mount in the mount namespace, and a test to see if
    the user is allowed to access the specified filesystem.

    The automount case is special in that mounting the original filesystem
    grants permission to mount the sub-filesystems, to any user who
    happens to stumble across the their mountpoint and satisfies the
    ordinary filesystem permission checks.

    Attempting to handle the automount case by using override_creds
    almost works. It preserves the idea that permission to mount
    the original filesystem is permission to mount the sub-filesystem.
    Unfortunately using override_creds messes up the filesystems
    ordinary permission checks.

    Solve this by being explicit that a mount is a submount by introducing
    vfs_submount, and using it where appropriate.

    vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
    sget and friends know that a mount is a submount so they can take appropriate
    action.

    sget and sget_userns are modified to not perform any permission checks
    on submounts.

    follow_automount is modified to stop using override_creds as that
    has proven problemantic.

    do_mount is modified to always remove the new MS_SUBMOUNT flag so
    that we know userspace will never by able to specify it.

    autofs4 is modified to stop using current_real_cred that was put in
    there to handle the previous version of submount permission checking.

    cifs is modified to pass the mountpoint all of the way down to vfs_submount.

    debugfs is modified to pass the mountpoint all of the way down to
    trace_automount by adding a new parameter. To make this change easier
    a new typedef debugfs_automount_t is introduced to capture the type of
    the debugfs automount function.

    Cc: stable@vger.kernel.org
    Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid")
    Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds")
    Reviewed-by: Trond Myklebust
    Reviewed-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

17 Dec, 2016

1 commit


06 Dec, 2016

1 commit


04 Dec, 2016

1 commit

  • d_mountpoint() can only be used reliably to establish if a dentry is
    not mounted in any namespace. It isn't aware of the possibility there
    may be multiple mounts using a given dentry that may be in a different
    namespace.

    Add helper functions, path_is_mountpoint(), that checks if a struct path
    is a mountpoint for this case.

    Link: http://lkml.kernel.org/r/20161011053358.27645.9729.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: Omar Sandoval
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Ian Kent
     

01 Oct, 2016

1 commit

  • CAI Qian pointed out that the semantics
    of shared subtrees make it possible to create an exponentially
    increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

    Will create create 2^20 or 1048576 mounts, which is a practical problem
    as some people have managed to hit this by accident.

    As such CVE-2016-6213 was assigned.

    Ian Kent described the situation for autofs users
    as follows:

    > The number of mounts for direct mount maps is usually not very large because of
    > the way they are implemented, large direct mount maps can have performance
    > problems. There can be anywhere from a few (likely case a few hundred) to less
    > than 10000, plus mounts that have been triggered and not yet expired.
    >
    > Indirect mounts have one autofs mount at the root plus the number of mounts that
    > have been triggered and not yet expired.
    >
    > The number of autofs indirect map entries can range from a few to the common
    > case of several thousand and in rare cases up to between 30000 and 50000. I've
    > not heard of people with maps larger than 50000 entries.
    >
    > The larger the number of map entries the greater the possibility for a large
    > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
    > more active mounts.

    So I am setting the default number of mounts allowed per mount
    namespace at 100,000. This is more than enough for any use case I
    know of, but small enough to quickly stop an exponential increase
    in mounts. Which should be perfect to catch misconfigurations and
    malfunctioning programs.

    For anyone who needs a higher limit this can be changed by writing
    to the new /proc/sys/fs/mount-max sysctl.

    Tested-by: CAI Qian
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

24 Jun, 2016

1 commit

  • If a process gets access to a mount from a different user
    namespace, that process should not be able to take advantage of
    setuid files or selinux entrypoints from that filesystem. Prevent
    this by treating mounts from other mount namespaces and those not
    owned by current_user_ns() or an ancestor as nosuid.

    This will make it safer to allow more complex filesystems to be
    mounted in non-root user namespaces.

    This does not remove the need for MNT_LOCK_NOSUID. The setuid,
    setgid, and file capability bits can no longer be abused if code in
    a user namespace were to clear nosuid on an untrusted filesystem,
    but this patch, by itself, is insufficient to protect the system
    from abuse of files that, when execed, would increase MAC privilege.

    As a more concrete explanation, any task that can manipulate a
    vfsmount associated with a given user namespace already has
    capabilities in that namespace and all of its descendents. If they
    can cause a malicious setuid, setgid, or file-caps executable to
    appear in that mount, then that executable will only allow them to
    elevate privileges in exactly the set of namespaces in which they
    are already privileges.

    On the other hand, if they can cause a malicious executable to
    appear with a dangerous MAC label, running it could change the
    caller's security context in a way that should not have been
    possible, even inside the namespace in which the task is confined.

    As a hardening measure, this would have made CVE-2014-5207 much
    more difficult to exploit.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Seth Forshee
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Andy Lutomirski
     

18 Apr, 2015

1 commit

  • Pull usernamespace mount fixes from Eric Biederman:
    "Way back in October Andrey Vagin reported that umount(MNT_DETACH)
    could be used to defeat MNT_LOCKED. As I worked to fix this I
    discovered that combined with mount propagation and an appropriate
    selection of shared subtrees a reference to a directory on an
    unmounted filesystem is not necessary.

    That MNT_DETACH is allowed in user namespace in a form that can break
    MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

    To avoid breaking existing userspace the conflict between MNT_DETACH
    and MNT_LOCKED is fixed by leaving mounts that are locked to their
    parents in the mount hash table until the last reference goes away.

    While investigating this issue I also found an issue with
    __detach_mounts. The code was unnecessarily and incorrectly
    triggering mount propagation. Resulting in too many mounts going away
    when a directory is deleted, and too many cpu cycles are burned while
    doing that.

    Looking some more I realized that __detach_mounts by only keeping
    mounts connected that were MNT_LOCKED it had the potential to still
    leak information so I tweaked the code to keep everything locked
    together that possibly could be.

    This code was almost ready last cycle but Al invented fs_pin which
    slightly simplifies this code but required rewrites and retesting, and
    I have not been in top form for a while so it took me a while to get
    all of that done. Similiarly this pull request is late because I have
    been feeling absolutely miserable all week.

    The issue of being able to escape a bind mount has not yet been
    addressed, as the fixes are not yet mature"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    mnt: Update detach_mounts to leave mounts connected
    mnt: Fix the error check in __detach_mounts
    mnt: Honor MNT_LOCKED when detaching mounts
    fs_pin: Allow for the possibility that m_list or s_list go unused.
    mnt: Factor umount_mnt from umount_tree
    mnt: Factor out unhash_mnt from detach_mnt and umount_tree
    mnt: Fail collect_mounts when applied to unmounted mounts
    mnt: Don't propagate unmounts to locked mounts
    mnt: On an unmount propagate clearing of MNT_LOCKED
    mnt: Delay removal from the mount hash.
    mnt: Add MNT_UMOUNT flag
    mnt: In umount_tree reuse mnt_list instead of mnt_hash
    mnt: Don't propagate umounts in __detach_mounts
    mnt: Improve the umount_tree flags
    mnt: Use hlist_move_list in namespace_unlock

    Linus Torvalds
     

16 Apr, 2015

1 commit


03 Apr, 2015

1 commit

  • In some instances it is necessary to know if the the unmounting
    process has begun on a mount. Add MNT_UMOUNT to make that reliably
    testable.

    This fix gets used in fixing locked mounts in MNT_DETACH

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

24 Oct, 2014

1 commit


12 Aug, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "Stuff in here:

    - acct.c fixes and general rework of mnt_pin mechanism. That allows
    to go for delayed-mntput stuff, which will permit mntput() on deep
    stack without worrying about stack overflows - fs shutdown will
    happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
    series without introducing tons of stack overflows on new mntput()
    call chains it introduces.
    - Bruce's d_splice_alias() patches
    - more Miklos' rename() stuff.
    - a couple of regression fixes (stable fodder, in the end of branch)
    and a fix for API idiocy in iov_iter.c.

    There definitely will be another pile, maybe even two. I'd like to
    get Eric's series in this time, but even if we miss it, it'll go right
    in the beginning of for-next in the next cycle - the tricky part of
    prereqs is in this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    fix copy_tree() regression
    __generic_file_write_iter(): fix handling of sync error after DIO
    switch iov_iter_get_pages() to passing maximal number of pages
    fs: mark __d_obtain_alias static
    dcache: d_splice_alias should detect loops
    exportfs: update Exporting documentation
    dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
    dcache: remove unused d_find_alias parameter
    dcache: d_obtain_alias callers don't all want DISCONNECTED
    dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
    dcache: d_splice_alias mustn't create directory aliases
    dcache: close d_move race in d_splice_alias
    dcache: move d_splice_alias
    namei: trivial fix to vfs_rename_dir comment
    VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
    cifs: support RENAME_NOREPLACE
    hostfs: support rename flags
    shmem: support RENAME_EXCHANGE
    shmem: support RENAME_NOREPLACE
    btrfs: add RENAME_NOREPLACE
    ...

    Linus Torvalds
     

08 Aug, 2014

1 commit

  • Rather than playing silly buggers with vfsmount refcounts, just have
    acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
    and replace it with said clone. Then attach the pin to original
    vfsmount. Voila - the clone will be alive until the file gets closed,
    making sure that underlying superblock remains active, etc., and
    we can drop the original vfsmount, so that it's not kept busy.
    If the file lives until the final mntput of the original vfsmount,
    we'll notice that there's an fs_pin (one in bsd_acct_struct that
    holds that file) and mnt_pin_kill() will take it out. Since
    ->kill() is synchronous, we won't proceed past that point until
    these files are closed (and private clones of our vfsmount are
    gone), so we get the same ordering warranties we used to get.

    mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
    it never became usable outside of kernel/acct.c (and racy wrt
    umount even there).

    Signed-off-by: Al Viro

    Al Viro
     

01 Aug, 2014

2 commits

  • While invesgiating the issue where in "mount --bind -oremount,ro ..."
    would result in later "mount --bind -oremount,rw" succeeding even if
    the mount started off locked I realized that there are several
    additional mount flags that should be locked and are not.

    In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
    flags in addition to MNT_READONLY should all be locked. These
    flags are all per superblock, can all be changed with MS_BIND,
    and should not be changable if set by a more privileged user.

    The following additions to the current logic are added in this patch.
    - nosuid may not be clearable by a less privileged user.
    - nodev may not be clearable by a less privielged user.
    - noexec may not be clearable by a less privileged user.
    - atime flags may not be changeable by a less privileged user.

    The logic with atime is that always setting atime on access is a
    global policy and backup software and auditing software could break if
    atime bits are not updated (when they are configured to be updated),
    and serious performance degradation could result (DOS attack) if atime
    updates happen when they have been explicitly disabled. Therefore an
    unprivileged user should not be able to mess with the atime bits set
    by a more privileged user.

    The additional restrictions are implemented with the addition of
    MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
    mnt flags.

    Taken together these changes and the fixes for MNT_LOCK_READONLY
    should make it safe for an unprivileged user to create a user
    namespace and to call "mount --bind -o remount,... ..." without
    the danger of mount flags being changed maliciously.

    Cc: stable@vger.kernel.org
    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Kenton Varda discovered that by remounting a
    read-only bind mount read-only in a user namespace the
    MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
    to the remount a read-only mount read-write.

    Correct this by replacing the mask of mount flags to preserve
    with a mask of mount flags that may be changed, and preserve
    all others. This ensures that any future bugs with this mask and
    remount will fail in an easy to detect way where new mount flags
    simply won't change.

    Cc: stable@vger.kernel.org
    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

02 Apr, 2014

1 commit

  • The current mainline has copies propagated to *all* nodes, then
    tears down the copies we made for nodes that do not contain
    counterparts of the desired mountpoint. That sets the right
    propagation graph for the copies (at teardown time we move
    the slaves of removed node to a surviving peer or directly
    to master), but we end up paying a fairly steep price in
    useless allocations. It's fairly easy to create a situation
    where N calls of mount(2) create exactly N bindings, with
    O(N^2) vfsmounts allocated and freed in process.

    Fortunately, it is possible to avoid those allocations/freeings.
    The trick is to create copies in the right order and find which
    one would've eventually become a master with the current algorithm.
    It turns out to be possible in O(nodes getting propagation) time
    and with no extra allocations at all.

    One part is that we need to make sure that eventual master will be
    created before its slaves, so we need to walk the propagation
    tree in a different order - by peer groups. And iterate through
    the peers before dealing with the next group.

    Another thing is finding the (earlier) copy that will be a master
    of one we are about to create; to do that we are (temporary) marking
    the masters of mountpoints we are attaching the copies to.

    Either we are in a peer of the last mountpoint we'd dealt with,
    or we have the following situation: we are attaching to mountpoint M,
    the last copy S_0 had been attached to M_0 and there are sequences
    S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
    S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
    such that M is getting propagation from M_{k}. It means that the master
    of M_{k} will be among the sequence of masters of M. On the
    other hand, the nearest marked node in that sequence will either
    be the master of M_{k} or the master of M_{k-1} (the latter -
    in the case if M_{k-1} is a slave of something M gets propagation
    from, but in a wrong peer group).

    So we go through the sequence of masters of M until we find
    a marked one (P). Let N be the one before it. Then we go through
    the sequence of masters of S_0 until we find one (say, S) mounted
    on a node D that has P as master and check if D is a peer of N.
    If it is, S will be the master of new copy, if not - the master of S
    will be.

    That's it for the hard part; the rest is fairly simple. Iterator
    is in next_group(), handling of one prospective mountpoint is
    propagate_one().

    It seems to survive all tests and gives a noticably better performance
    than the current mainline for setups that are seriously using shared
    subtrees.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

09 Nov, 2013

1 commit

  • * RCU-delayed freeing of vfsmounts
    * vfsmount_lock replaced with a seqlock (mount_lock)
    * sequence number from mount_lock is stored in nameidata->m_seq and
    used when we exit RCU mode
    * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its
    caller knows that vfsmount will have no surviving references.
    * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
    and doing pending mntput().
    * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence
    number against seq, then grabs reference to mnt. Then it rechecks mount_lock
    again to close the race and either returns success or drops the reference it
    has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can
    simply decrement the refcount and sod off - aforementioned synchronize_rcu()
    makes sure that final mntput() won't come until we leave RCU mode. We need
    that, since we don't want to end up with some lazy pathwalk racing with
    umount() and stealing the final mntput() from it - caller of umount() may
    expect it to return only once the fs is shut down and we don't want to break
    that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
    full-blown mntput() in case of mount_lock sequence number mismatch happening
    just as we'd grabbed the reference, but in those cases we won't be stealing
    the final mntput() from anything that would care.
    * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally,
    SMP and UP cases are handled the same way - no ifdefs there.
    * normal pathname resolution does *not* do any writes to mount_lock. It does,
    of course, bump the refcounts of vfsmount and dentry in the very end, but that's
    it.

    Signed-off-by: Al Viro

    Al Viro
     

25 Jul, 2013

1 commit

  • When creating a less privileged mount namespace or propogating mounts
    from a more privileged to a less privileged mount namespace lock the
    submounts so they may not be unmounted individually in the child mount
    namespace revealing what is under them.

    This enforces the reasonable expectation that it is not possible to
    see under a mount point. Most of the time mounts are on empty
    directories and revealing that does not matter, however I have seen an
    occassionaly sloppy configuration where there were interesting things
    concealed under a mount point that probably should not be revealed.

    Expirable submounts are not locked because they will eventually
    unmount automatically so whatever is under them already needs
    to be safe for unprivileged users to access.

    From a practical standpoint these restrictions do not appear to be
    significant for unprivileged users of the mount namespace. Recursive
    bind mounts and pivot_root continues to work, and mounts that are
    created in a mount namespace may be unmounted there. All of which
    means that the common idiom of keeping a directory of interesting
    files and using pivot_root to throw everything else away continues to
    work just fine.

    Acked-by: Serge Hallyn
    Acked-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Mar, 2013

1 commit

  • When a read-only bind mount is copied from mount namespace in a higher
    privileged user namespace to a mount namespace in a lesser privileged
    user namespace, it should not be possible to remove the the read-only
    restriction.

    Add a MNT_LOCK_READONLY mount flag to indicate that a mount must
    remain read-only.

    CC: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jan, 2012

15 commits


27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

17 Jan, 2011

1 commit

  • Instead of splitting refcount between (per-cpu) mnt_count
    and (SMP-only) mnt_longrefs, make all references contribute
    to mnt_count again and keep track of how many are longterm
    ones.

    Accounting rules for longterm count:
    * 1 for each fs_struct.root.mnt
    * 1 for each fs_struct.pwd.mnt
    * 1 for having non-NULL ->mnt_ns
    * decrement to 0 happens only under vfsmount lock exclusive

    That allows nice common case for mntput() - since we can't drop the
    final reference until after mnt_longterm has reached 0 due to the rules
    above, mntput() can grab vfsmount lock shared and check mnt_longterm.
    If it turns out to be non-zero (which is the common case), we know
    that this is not the final mntput() and can just blindly decrement
    percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
    do usual decrement-and-check of percpu mnt_count.

    For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
    namespace.c uses the latter in places where we don't already hold
    vfsmount lock exclusive and opencodes a few remaining spots where
    we need to manipulate mnt_longterm.

    Note that we mostly revert the code outside of fs/namespace.c back
    to what we used to have; in particular, normal code doesn't need
    to care about two kinds of references, etc. And we get to keep
    the optimization Nick's variant had bought us...

    Signed-off-by: Al Viro

    Al Viro