13 May, 2020

1 commit

  • For quite a while we have been thinking about using pidfds to attach to
    namespaces. This patchset has existed for about a year already but we've
    wanted to wait to see how the general api would be received and adopted.
    Now that more and more programs in userspace have started using pidfds
    for process management it's time to send this one out.

    This patch makes it possible to use pidfds to attach to the namespaces
    of another process, i.e. they can be passed as the first argument to the
    setns() syscall. When only a single namespace type is specified the
    semantics are equivalent to passing an nsfd. That means
    setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
    when a pidfd is passed, multiple namespace flags can be specified in the
    second setns() argument and setns() will attach the caller to all the
    specified namespaces all at once or to none of them. Specifying 0 is not
    valid together with a pidfd.

    Here are just two obvious examples:
    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);
    Allowing to also attach subsets of namespaces supports various use-cases
    where callers setns to a subset of namespaces to retain privilege, perform
    an action and then re-attach another subset of namespaces.

    If the need arises, as Eric suggested, we can extend this patchset to
    assume even more context than just attaching all namespaces. His suggestion
    specifically was about assuming the process' root directory when
    setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
    keep it flexible in terms of supporting subsets of namespaces but let's
    wait until we have users asking for even more context to be assumed. At
    that point we can add an extension.

    The obvious example where this is useful is a standard container
    manager interacting with a running container: pushing and pulling files
    or directories, injecting mounts, attaching/execing any kind of process,
    managing network devices all these operations require attaching to all
    or at least multiple namespaces at the same time. Given that nowadays
    most containers are spawned with all namespaces enabled we're currently
    looking at at least 14 syscalls, 7 to open the /proc//ns/
    nsfds, another 7 to actually perform the namespace switch. With time
    namespaces we're looking at about 16 syscalls.
    (We could amortize the first 7 or 8 syscalls for opening the nsfds by
    stashing them in each container's monitor process but that would mean
    we need to send around those file descriptors through unix sockets
    everytime we want to interact with the container or keep on-disk
    state. Even in scenarios where a caller wants to join a particular
    namespace in a particular order callers still profit from batching
    other namespaces. That mostly applies to the user namespace but
    all container runtimes I found join the user namespace first no matter
    if it privileges or deprivileges the container similar to how unshare
    behaves.)
    With pidfds this becomes a single syscall no matter how many namespaces
    are supposed to be attached to.

    A decently designed, large-scale container manager usually isn't the
    parent of any of the containers it spawns so the containers don't die
    when it crashes or needs to update or reinitialize. This means that
    for the manager to interact with containers through pids is inherently
    racy especially on systems where the maximum pid number is not
    significicantly bumped. This is even more problematic since we often spawn
    and manage thousands or ten-thousands of containers. Interacting with a
    container through a pid thus can become risky quite quickly. Especially
    since we allow for an administrator to enable advanced features such as
    syscall interception where we're performing syscalls in lieu of the
    container. In all of those cases we use pidfds if they are available and
    we pass them around as stable references. Using them to setns() to the
    target process' namespaces is as reliable as using nsfds. Either the
    target process is already dead and we get ESRCH or we manage to attach
    to its namespaces but we can't accidently attach to another process'
    namespaces. So pidfds lend themselves to be used with this api.
    The other main advantage is that with this change the pidfd becomes the
    only relevant token for most container interactions and it's the only
    token we need to create and send around.

    Apart from significiantly reducing the number of syscalls from double
    digit to single digit which is a decent reason post-spectre/meltdown
    this also allows to switch to a set of namespaces atomically, i.e.
    either attaching to all the specified namespaces succeeds or we fail. If
    we fail we haven't changed a single namespace. There are currently three
    namespaces that can fail (other than for ENOMEM which really is not
    very interesting since we then have other problems anyway) for
    non-trivial reasons, user, mount, and pid namespaces. We can fail to
    attach to a pid namespace if it is not our current active pid namespace
    or a descendant of it. We can fail to attach to a user namespace because
    we are multi-threaded or because our current mount namespace shares
    filesystem state with other tasks, or because we're trying to setns()
    to the same user namespace, i.e. the target task has the same user
    namespace as we do. We can fail to attach to a mount namespace because
    it shares filesystem state with other tasks or because we fail to lookup
    the new root for the new mount namespace. In most non-pathological
    scenarios these issues can be somewhat mitigated. But there are cases where
    we're half-attached to some namespace and failing to attach to another one.
    I've talked about some of these problem during the hallway track (something
    only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
    in 2018(?). Even if all these issues could be avoided with super careful
    userspace coding it would be nicer to have this done in-kernel. Pidfds seem
    to lend themselves nicely for this.

    The other neat thing about this is that setns() becomes an actual
    counterpart to the namespace bits of unshare().

    Signed-off-by: Christian Brauner
    Reviewed-by: Serge Hallyn
    Cc: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Jann Horn
    Cc: Michael Kerrisk
    Cc: Aleksa Sarai
    Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com

    Christian Brauner
     

13 Mar, 2020

1 commit

  • ns_match returns true if the namespace inode and dev_t matches the ones
    provided by the caller.

    Signed-off-by: Carlos Neira
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200304204157.58695-2-cneirabustos@gmail.com

    Carlos Neira
     

30 Jan, 2020

1 commit

  • Pull openat2 support from Al Viro:
    "This is the openat2() series from Aleksa Sarai.

    I'm afraid that the rest of namei stuff will have to wait - it got
    zero review the last time I'd posted #work.namei, and there had been a
    leak in the posted series I'd caught only last weekend. I was going to
    repost it on Monday, but the window opened and the odds of getting any
    review during that... Oh, well.

    Anyway, openat2 part should be ready; that _did_ get sane amount of
    review and public testing, so here it comes"

    From Aleksa's description of the series:
    "For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown
    flags are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road
    to being added to openat(2).

    Furthermore, the need for some sort of control over VFS's path
    resolution (to avoid malicious paths resulting in inadvertent
    breakouts) has been a very long-standing desire of many userspace
    applications.

    This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
    (which was a variant of David Drysdale's O_BENEATH patchset[4] which
    was a spin-off of the Capsicum project[5]) with a few additions and
    changes made based on the previous discussion within [6] as well as
    others I felt were useful.

    In line with the conclusions of the original discussion of
    AT_NO_JUMPS, the flag has been split up into separate flags. However,
    instead of being an openat(2) flag it is provided through a new
    syscall openat2(2) which provides several other improvements to the
    openat(2) interface (see the patch description for more details). The
    following new LOOKUP_* flags are added:

    LOOKUP_NO_XDEV:

    Blocks all mountpoint crossings (upwards, downwards, or through
    absolute links). Absolute pathnames alone in openat(2) do not
    trigger this. Magic-link traversal which implies a vfsmount jump is
    also blocked (though magic-link jumps on the same vfsmount are
    permitted).

    LOOKUP_NO_MAGICLINKS:

    Blocks resolution through /proc/$pid/fd-style links. This is done
    by blocking the usage of nd_jump_link() during resolution in a
    filesystem. The term "magic-links" is used to match with the only
    reference to these links in Documentation/, but I'm happy to change
    the name.

    It should be noted that this is different to the scope of
    ~LOOKUP_FOLLOW in that it applies to all path components. However,
    you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
    will *not* fail (assuming that no parent component was a
    magic-link), and you will have an fd for the magic-link.

    In order to correctly detect magic-links, the introduction of a new
    LOOKUP_MAGICLINK_JUMPED state flag was required.

    LOOKUP_BENEATH:

    Disallows escapes to outside the starting dirfd's
    tree, using techniques such as ".." or absolute links. Absolute
    paths in openat(2) are also disallowed.

    Conceptually this flag is to ensure you "stay below" a certain
    point in the filesystem tree -- but this requires some additional
    to protect against various races that would allow escape using
    "..".

    Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
    can trivially beam you around the filesystem (breaking the
    protection). In future, there might be similar safety checks done
    as in LOOKUP_IN_ROOT, but that requires more discussion.

    In addition, two new flags are added that expand on the above ideas:

    LOOKUP_NO_SYMLINKS:

    Does what it says on the tin. No symlink resolution is allowed at
    all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
    can still be used with NOFOLLOW to open an fd for the symlink as
    long as no parent path had a symlink component.

    LOOKUP_IN_ROOT:

    This is an extension of LOOKUP_BENEATH that, rather than blocking
    attempts to move past the root, forces all such movements to be
    scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that
    chroot(2) is not.

    If a race is detected (as with LOOKUP_BENEATH) then an error is
    generated, and similar to LOOKUP_BENEATH it is not permitted to
    cross magic-links with LOOKUP_IN_ROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[7] when opening
    paths in a potentially malicious container.

    There is a long list of CVEs that could have bene mitigated by
    having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
    CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
    few).

    In order to make all of the above more usable, I'm working on
    libpathrs[8] which is a C-friendly library for safe path resolution.
    It features a userspace-emulated backend if the kernel doesn't support
    openat2(2). Hopefully we can get userspace to switch to using it, and
    thus get openat2(2) support for free once it's ready.

    Future work would include implementing things like
    RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
    programs to be sure they don't hit DoSes though stale NFS handles)"

    * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    Documentation: path-lookup: include new LOOKUP flags
    selftests: add openat2(2) selftests
    open: introduce openat2(2) syscall
    namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
    namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
    namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
    namei: LOOKUP_NO_XDEV: block mountpoint crossing
    namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
    namei: LOOKUP_NO_SYMLINKS: block symlink resolution
    namei: allow set_root() to produce errors
    namei: allow nd_jump_link() to produce errors
    nsfs: clean-up ns_get_path() signature to return int
    namei: only return -ECHILD from follow_dotdot_rcu()

    Linus Torvalds
     

05 Jan, 2020

1 commit

  • Include linux/proc_fs.h and fs/internal.h to address the following
    'sparse' warnings:

    fs/nsfs.c:41:32: warning: symbol 'ns_dentry_operations' was not declared. Should it be static?
    fs/nsfs.c:145:5: warning: symbol 'open_related_ns' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191209234822.156179-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

09 Dec, 2019

1 commit

  • ns_get_path() and ns_get_path_cb() only ever return either NULL or an
    ERR_PTR. It is far more idiomatic to simply return an integer, and it
    makes all of the callers of ns_get_path() more straightforward to read.

    Fixes: e149ed2b805f ("take the targets of /proc/*/ns/* symlinks to separate fs")
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     

26 May, 2019

2 commits

  • Convert the nsfs filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: Eric W. Biederman
    cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Al Viro

    David Howells
     
  • Once upon a time we used to set ->d_name of e.g. pipefs root
    so that d_path() on pipes would work. These days it's
    completely pointless - dentries of pipes are not even connected
    to pipefs root. However, mount_pseudo() had set the root
    dentry name (passed as the second argument) and callers
    kept inventing names to pass to it. Including those that
    didn't *have* any non-root dentries to start with...

    All of that had been pointless for about 8 years now; it's
    time to get rid of that cargo-culting...

    Signed-off-by: Al Viro

    Al Viro
     

10 Apr, 2019

2 commits

  • 1) IS_ERR(p) && PTR_ERR(p) == -E... is spelled p == ERR_PTR(-E...)
    2) yes, you can open-code do-while and sometimes there's even
    a good reason to do so. Not in this case, though.

    Signed-off-by: Al Viro

    Al Viro
     
  • For lockless accesses to dentries we don't have pinned we rely
    (among other things) upon having an RCU delay between dropping
    the last reference and actually freeing the memory.

    On the other hand, for things like pipes and sockets we neither
    do that kind of lockless access, nor want to deal with the
    overhead of an RCU delay every time a socket gets closed.

    So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags
    made sure it would happen. We tried to avoid setting it unless
    we knew we need it. Unfortunately, that had led to recurring
    class of bugs, in which we missed the need to set it.

    We only really need it for dentries that are created by
    d_alloc_pseudo(), so let's not bother with trying to be smart -
    just make having an RCU delay the default. The ones that do
    *not* get it set the replacement flag (DCACHE_NORCU) and we'd
    better use that sparingly. d_alloc_pseudo() is the only
    such user right now.

    FWIW, the race that finally prompted that switch had been
    between __lock_parent() of immediate subdirectory of what's
    currently the root of a disconnected tree (e.g. from
    open-by-handle in progress) racing with d_splice_alias()
    elsewhere picking another alias for the same inode, either
    on outright corrupted fs image, or (in case of open-by-handle
    on NFS) that subdirectory having been just moved on server.
    It's not easy to hit, so the sky is not falling, but that's
    not the first race on similar missed cases and the logics
    for settinf DCACHE_RCUACCESS has gotten ridiculously
    convoluted.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

16 Feb, 2018

1 commit


31 Dec, 2017

1 commit

  • ns_get_path() takes struct task_struct and proc_ns_ops as its
    parameters. For path resolution directly from a namespace,
    e.g. based on a networking device's net name space, we need
    more flexibility. Add a ns_get_path_cb() helper which will
    allow callers to use any method of obtaining the name space
    reference. Convert ns_get_path() to use ns_get_path_cb().

    Following patches will bring a networking user.

    CC: Eric W. Biederman
    Suggested-by: Daniel Borkmann
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     

28 Nov, 2017

1 commit

  • This is a pure automated search-and-replace of the internal kernel
    superblock flags.

    The s_flags are now called SB_*, with the names and the values for the
    moment mirroring the MS_* flags that they're equivalent to.

    Note how the MS_xyz flags are the ones passed to the mount system call,
    while the SB_xyz flags are what we then use in sb->s_flags.

    The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
    include/linux/fs.h include/uapi/linux/bfs_fs.h \
    security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
    DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
    POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
    I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
    ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

    Requested-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

06 Jul, 2017

1 commit


09 May, 2017

1 commit

  • Patch series "Expose task pid_ns_for_children to userspace".

    pid_ns_for_children set by a task is known only to the task itself, and
    it's impossible to identify it from outside.

    It's a big problem for checkpoint/restore software like CRIU, because it
    can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
    their work. If they have a custom pid_ns_for_children before dump, they
    must have the same ns after restore. Otherwise, restored task bumped
    into enviroment it does not expect.

    This patchset solves the problem. It exposes pid_ns_for_children to ns
    directory in standard way with the name "pid_for_children":

    ~# ls /proc/5531/ns -l | grep pid
    lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
    lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]

    This patch (of 2):

    Make possible to have link content prefix yyy different from the link
    name xxx:

    $ readlink /proc/[pid]/ns/xxx
    yyy:[4026531838]

    This will be used in next patch.

    Link: http://lkml.kernel.org/r/149201120318.6007.7362655181033883000.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Andrei Vagin
    Cc: Andreas Gruenbacher
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: Paul Moore
    Cc: Eric Biederman
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

20 Apr, 2017

1 commit

  • Andrey reported a use-after-free in __ns_get_path():

    spin_lock include/linux/spinlock.h:299 [inline]
    lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
    __ns_get_path+0x197/0x860 fs/nsfs.c:66
    open_related_ns+0xda/0x200 fs/nsfs.c:143
    sock_ioctl+0x39d/0x440 net/socket.c:1001
    vfs_ioctl fs/ioctl.c:45 [inline]
    do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
    SYSC_ioctl fs/ioctl.c:700 [inline]
    SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

    We are under rcu read lock protection at that point:

    rcu_read_lock();
    d = atomic_long_read(&ns->stashed);
    if (!d)
    goto slow;
    dentry = (struct dentry *)d;
    if (!lockref_get_not_dead(&dentry->d_lockref))
    goto slow;
    rcu_read_unlock();

    but don't use a proper RCU API on the free path, therefore a parallel
    __d_free() could free it at the same time. We need to mark the stashed
    dentry with DCACHE_RCUACCESS so that __d_free() will be called after all
    readers leave RCU.

    Fixes: e149ed2b805f ("take the targets of /proc/*/ns/* symlinks to separate fs")
    Cc: Alexander Viro
    Cc: Andrew Morton
    Reported-by: Andrey Konovalov
    Signed-off-by: Cong Wang
    Signed-off-by: Linus Torvalds

    Cong Wang
     

03 Feb, 2017

1 commit

  • I'd like to write code that discovers the user namespace hierarchy on a
    running system, and also shows who owns the various user namespaces.
    Currently, there is no way of getting the owner UID of a user namespace.
    Therefore, this patch adds a new NS_GET_CREATOR_UID ioctl() that fetches
    the UID (as seen in the user namespace of the caller) of the creator of
    the user namespace referred to by the specified file descriptor.

    If the supplied file descriptor does not refer to a user namespace,
    the operation fails with the error EINVAL. If the owner UID does
    not have a mapping in the caller's user namespace return the
    overflow UID as that appears easier to deal with in practice
    in user-space applications.

    -- EWB Changed the handling of unmapped UIDs from -EOVERFLOW
    back to the overflow uid. Per conversation with
    Michael Kerrisk after examining his test code.

    Acked-by: Andrey Vagin
    Signed-off-by: Michael Kerrisk
    Signed-off-by: Eric W. Biederman

    Michael Kerrisk (man-pages)
     

25 Jan, 2017

1 commit

  • Linux 4.9 added two ioctl() operations that can be used to discover:

    * the parental relationships for hierarchical namespaces (user and PID)
    [NS_GET_PARENT]
    * the user namespaces that owns a specified non-user-namespace
    [NS_GET_USERNS]

    For no good reason that I can glean, NS_GET_USERNS was made synonymous
    with NS_GET_PARENT for user namespaces. It might have been better if
    NS_GET_USERNS had returned an error if the supplied file descriptor
    referred to a user namespace, since it suggests that the caller may be
    confused. More particularly, if it had generated an error, then I wouldn't
    need the new ioctl() operation proposed here. (On the other hand, what
    I propose here may be more generally useful.)

    I would like to write code that discovers namespace relationships for
    the purpose of understanding the namespace setup on a running system.
    In particular, given a file descriptor (or pathname) for a namespace,
    N, I'd like to obtain the corresponding user namespace. Namespace N
    might be a user namespace (in which case my code would just use N) or
    a non-user namespace (in which case my code will use NS_GET_USERNS to
    get the user namespace associated with N). The problem is that there
    is no way to tell the difference by looking at the file descriptor
    (and if I try to use NS_GET_USERNS on an N that is a user namespace, I
    get the parent user namespace of N, which is not what I want).

    This patch therefore adds a new ioctl(), NS_GET_NSTYPE, which, given
    a file descriptor that refers to a user namespace, returns the
    namespace type (one of the CLONE_NEW* constants).

    Signed-off-by: Michael Kerrisk
    Signed-off-by: Eric W. Biederman

    Michael Kerrisk (man-pages)
     

31 Oct, 2016

1 commit

  • Each socket operates in a network namespace where it has been created,
    so if we want to dump and restore a socket, we have to know its network
    namespace.

    We have a socket_diag to get information about sockets, it doesn't
    report sockets which are not bound or connected.

    This patch introduces a new socket ioctl, which is called SIOCGSKNS
    and used to get a file descriptor for a socket network namespace.

    A task must have CAP_NET_ADMIN in a target network namespace to
    use this ioctl.

    Cc: "David S. Miller"
    Cc: Eric W. Biederman
    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

11 Oct, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     

28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

23 Sep, 2016

3 commits

  • Move mntget from the very beginning of __ns_get_path to
    the success path of __ns_get_path, and remove the mntget
    calls.

    This removes the possibility that there will be a mntget/mntput
    pair of __ns_get_path has to retry, and generally simplifies the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships.

    In a future we will use this interface to dump and restore nested
    namespaces.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Understending namespaces relationships allows to answer the question:
    what capability does process X have to perform operations on a resource
    governed by namespace Y?

    After a long discussion, Eric W. Biederman proposed to use ioctl-s for
    this purpose.

    The NS_GET_USERNS ioctl returns a file descriptor to an owning user
    namespace.
    It returns EPERM if a target namespace is outside of a current user
    namespace.

    v2: rename parent to relative

    v3: Add a missing mntput when returning -EAGAIN --EWB

    Acked-by: Serge Hallyn
    Link: https://lkml.org/lkml/2016/7/6/158
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     

12 Sep, 2015

1 commit

  • The seq_ function return values were frequently misused.

    See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
    seq_has_overflowed() and make public")

    All uses of these return values have been removed, so convert the
    return types to void.

    Miscellanea:

    o Move seq_put_decimal_ and seq_escape prototypes closer the
    other seq_vprintf prototypes
    o Reorder seq_putc and seq_puts to return early on overflow
    o Add argument names to seq_vprintf and seq_printf
    o Update the seq_escape kernel-doc
    o Convert a couple of leading spaces to tabs in seq_escape

    Signed-off-by: Joe Perches
    Cc: Al Viro
    Cc: Steven Rostedt
    Cc: Mark Brown
    Cc: Stephen Rothwell
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

12 Jul, 2015

1 commit


16 Apr, 2015

1 commit


11 Dec, 2014

1 commit

  • New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
    It's not mountable (not even registered, so it's not in /proc/filesystems,
    etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().

    This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
    get_proc_ns() is a macro now (it's simply returning ->i_private; would
    have been an inline, if not for header ordering headache).
    proc_ns_inode() is an ex-parrot. The interface used in procfs is
    ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).

    Dentries and inodes are never hashed; a non-counting reference to dentry
    is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
    if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
    of that mechanism.

    As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
    it does nd_jump_link() on a consistent pair it gets
    from ns_get_path().

    Signed-off-by: Al Viro

    Al Viro