04 Apr, 2018

1 commit

  • Pull namespace updates from Eric Biederman:
    "There was a lot of work this cycle fixing bugs that were discovered
    after the merge window and getting everything ready where we can
    reasonably support fully unprivileged fuse. The bug fixes you already
    have and much of the unprivileged fuse work is coming in via other
    trees.

    Still left for fully unprivileged fuse is figuring out how to cleanly
    handle .set_acl and .get_acl in the legacy case, and properly handling
    of evm xattrs on unprivileged mounts.

    Included in the tree is a cleanup from Alexely that replaced a linked
    list with a statically allocated fix sized array for the pid caches,
    which simplifies and speeds things up.

    Then there is are some cleanups and fixes for the ipc namespace. The
    motivation was that in reviewing other code it was discovered that
    access ipc objects from different pid namespaces recorded pids in such
    a way that when asked the wrong pids were returned. In the worst case
    there has been a measured 30% performance impact for sysvipc
    semaphores. Other test cases showed no measurable performance impact.
    Manfred Spraul and Davidlohr Bueso who tend to work on sysvipc
    performance both gave the nod that this is good enough.

    Casey Schaufler and James Morris have given their approval to the LSM
    side of the changes.

    I simplified the types and the code dealing with sysvipc to pass just
    kern_ipc_perm for all three types of ipc. Which reduced the header
    dependencies throughout the kernel and simplified the lsm code.

    Which let me work on the pid fixes without having to worry about
    trivial changes causing complete kernel recompiles"

    * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    ipc/shm: Fix pid freeing.
    ipc/shm: fix up for struct file no longer being available in shm.h
    ipc/smack: Tidy up from the change in type of the ipc security hooks
    ipc: Directly call the security hook in ipc_ops.associate
    ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces
    ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces
    ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces.
    ipc/util: Helpers for making the sysvipc operations pid namespace aware
    ipc: Move IPCMNI from include/ipc.h into ipc/util.h
    msg: Move struct msg_queue into ipc/msg.c
    shm: Move struct shmid_kernel into ipc/shm.c
    sem: Move struct sem and struct sem_array into ipc/sem.c
    msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks
    shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks
    sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks
    pidns: simpler allocation of pid_* caches

    Linus Torvalds
     

03 Apr, 2018

1 commit

  • All call sites of sys_wait4() set *rusage to NULL. Therefore, there is
    no need for the copy_to_user() handling of *rusage, and we can use
    kernel_wait4() directly.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Acked-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

21 Mar, 2018

1 commit

  • Those pid_* caches are created on demand when a process advances to the new
    level of pid namespace. Which means pointers are stable, write only and
    thus can be packed into an array instead of spreading them over and using
    lists(!) to find them.

    Both first and subsequent clone/unshare(CLONE_NEWPID) become faster.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Eric W. Biederman

    Alexey Dobriyan
     

18 Nov, 2017

2 commits

  • pidhash is no longer required as all the information can be looked up
    from idr tree. nr_hashed represented the number of pids that had been
    hashed. Since, nr_hashed and PIDNS_HASH_ADDING are no longer relevant,
    it has been renamed to pid_allocated and PIDNS_ADDING respectively.

    [gs051095@gmail.com: v6]
    Link: http://lkml.kernel.org/r/1507760379-21662-3-git-send-email-gs051095@gmail.com
    Link: http://lkml.kernel.org/r/1507583624-22146-3-git-send-email-gs051095@gmail.com
    Signed-off-by: Gargi Sharma
    Reviewed-by: Rik van Riel
    Tested-by: Tony Luck [ia64]
    Cc: Julia Lawall
    Cc: Ingo Molnar
    Cc: Pavel Tatashin
    Cc: Kirill Tkhai
    Cc: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gargi Sharma
     
  • Patch series "Replacing PID bitmap implementation with IDR API", v4.

    This series replaces kernel bitmap implementation of PID allocation with
    IDR API. These patches are written to simplify the kernel by replacing
    custom code with calls to generic code.

    The following are the stats for pid and pid_namespace object files
    before and after the replacement. There is a noteworthy change between
    the IDR and bitmap implementation.

    Before
    text data bss dec hex filename
    8447 3894 64 12405 3075 kernel/pid.o
    After
    text data bss dec hex filename
    3397 304 0 3701 e75 kernel/pid.o

    Before
    text data bss dec hex filename
    5692 1842 192 7726 1e2e kernel/pid_namespace.o
    After
    text data bss dec hex filename
    2854 216 16 3086 c0e kernel/pid_namespace.o

    The following are the stats for ps, pstree and calling readdir on /proc
    for 10,000 processes.

    ps:
    With IDR API With bitmap
    real 0m1.479s 0m2.319s
    user 0m0.070s 0m0.060s
    sys 0m0.289s 0m0.516s

    pstree:
    With IDR API With bitmap
    real 0m1.024s 0m1.794s
    user 0m0.348s 0m0.612s
    sys 0m0.184s 0m0.264s

    proc:
    With IDR API With bitmap
    real 0m0.059s 0m0.074s
    user 0m0.000s 0m0.004s
    sys 0m0.016s 0m0.016s

    This patch (of 2):

    Replace the current bitmap implementation for Process ID allocation.
    Functions that are no longer required, for example, free_pidmap(),
    alloc_pidmap(), etc. are removed. The rest of the functions are
    modified to use the IDR API. The change was made to make the PID
    allocation less complex by replacing custom code with calls to generic
    API.

    [gs051095@gmail.com: v6]
    Link: http://lkml.kernel.org/r/1507760379-21662-2-git-send-email-gs051095@gmail.com
    [avagin@openvz.org: restore the old behaviour of the ns_last_pid sysctl]
    Link: http://lkml.kernel.org/r/20171106183144.16368-1-avagin@openvz.org
    Link: http://lkml.kernel.org/r/1507583624-22146-2-git-send-email-gs051095@gmail.com
    Signed-off-by: Gargi Sharma
    Reviewed-by: Rik van Riel
    Acked-by: Oleg Nesterov
    Cc: Julia Lawall
    Cc: Ingo Molnar
    Cc: Pavel Tatashin
    Cc: Kirill Tkhai
    Cc: Eric W. Biederman
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gargi Sharma
     

20 Jul, 2017

1 commit

  • It is pointless and confusing to allow a pid namespace hierarchy and
    the user namespace hierarchy to get out of sync. The owner of a child
    pid namespace should be the owner of the parent pid namespace or
    a descendant of the owner of the parent pid namespace.

    Otherwise it is possible to construct scenarios where a process has a
    capability over a parent pid namespace but does not have the
    capability over a child pid namespace. Which confusingly makes
    permission checks non-transitive.

    It requires use of setns into a pid namespace (but not into a user
    namespace) to create such a scenario.

    Add the function in_userns to help in making this determination.

    v2: Optimized in_userns by using level as suggested
    by: Kirill Tkhai

    Ref: 49f4d8b93ccf ("pidns: Capture the user namespace and filter ns_last_pid")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

14 May, 2017

1 commit

  • The code can potentially sleep for an indefinite amount of time in
    zap_pid_ns_processes triggering the hung task timeout, and increasing
    the system average. This is undesirable. Sleep with a task state of
    TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these
    undesirable side effects.

    Apparently under heavy load this has been allowing Chrome to trigger
    the hung time task timeout error and cause ChromeOS to reboot.

    Reported-by: Vovo Yang
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Fixes: 6347e9009104 ("pidns: guarantee that the pidns init will be the last pidns process reaped")
    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 May, 2017

1 commit

  • pid_ns_for_children set by a task is known only to the task itself, and
    it's impossible to identify it from outside.

    It's a big problem for checkpoint/restore software like CRIU, because it
    can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
    their work.

    This patch solves the problem, and it exposes pid_ns_for_children to ns
    directory in standard way with the name "pid_for_children":

    ~# ls /proc/5531/ns -l | grep pid
    lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
    lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]

    Link: http://lkml.kernel.org/r/149201123914.6007.2187327078064239572.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Cc: Andrei Vagin
    Cc: Andreas Gruenbacher
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: Paul Moore
    Cc: Eric Biederman
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

02 Mar, 2017

3 commits


10 Jan, 2017

1 commit

  • =========================================================
    [ INFO: possible irq lock inversion dependency detected ]
    4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G W
    ---------------------------------------------------------
    swapper/1/0 just changed the state of lock:
    (&(&sighand->siglock)->rlock){-.....}, at: [] __lock_task_sighand+0xb6/0x2c0
    but this lock took another, HARDIRQ-unsafe lock in the past:
    (ucounts_lock){+.+...}
    and interrupts could create inverse lock ordering between them.
    other info that might help us debug this:
    Chain exists of: &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
    Possible interrupt unsafe locking scenario:
    CPU0 CPU1
    ---- ----
    lock(ucounts_lock);
    local_irq_disable();
    lock(&(&sighand->siglock)->rlock);
    lock(&(&tty->ctrl_lock)->rlock);

    lock(&(&sighand->siglock)->rlock);

    *** DEADLOCK ***

    This patch removes a dependency between rlock and ucount_lock.

    Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrei Vagin
    Acked-by: Al Viro
    Signed-off-by: Eric W. Biederman

    Andrei Vagin
     

23 Sep, 2016

4 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships.

    In a future we will use this interface to dump and restore nested
    namespaces.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • The current error codes returned when a the per user per user
    namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
    asked for advice on linux-api and it we made clear that those were
    the wrong error code, but a correct effor code was not suggested.

    The best general error code I have found for hitting a resource limit
    is ENOSPC. It is not perfect but as it is unambiguous it will serve
    until someone comes up with a better error code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Aug, 2016

1 commit


17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • The comments in zap_pid_ns_processes() are not clear, we need to explain
    how this code actually works.

    1. "Ignore SIGCHLD" looks like optimization but it is not, we also
    need this for correctness.

    2. The comment above sys_wait4() could tell more.

    EXIT_ZOMBIE child is only possible if it has exited before we
    ignored SIGCHLD. Or if it is traced from the parent namespace,
    but in this case it will be reaped by debugger after detach,
    sys_wait4() acts as a synchronization point.

    3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is
    outdated. Contrary to what it says we do not need to make sure
    they all go away after 0a01f2cc390e "pidns: Make the pidns proc
    mount/umount logic obvious".

    At the same time, we do need to wait for nr_hashed==init_pids,
    but the reasons are quite different and not obvious: setns().

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Aaron Tomlin
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Dec, 2014

5 commits


03 Apr, 2014

1 commit

  • pidns_get()->get_pid_ns() can hit ns == NULL. This task_struct can't
    go away, but task_active_pid_ns(task) is NULL if release_task(task)
    was already called. Alternatively we could change get_pid_ns(ns) to
    check ns != NULL, but it seems that other callers are fine.

    Signed-off-by: Oleg Nesterov
    Cc: Eric W. Biederman ebiederm@xmission.com>
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

25 Oct, 2013

1 commit


08 Sep, 2013

1 commit

  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     

31 Aug, 2013

1 commit


28 Aug, 2013

1 commit


02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells
     

01 May, 2013

1 commit

  • Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
    simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

    [akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
    Signed-off-by: Raphael S.Carvalho
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S.Carvalho
     

26 Mar, 2013

1 commit

  • When a multi-threaded init exits and the initial thread is not the
    last thread to exit the initial thread hangs around as a zombie
    until the last thread exits. In that case zap_pid_ns_processes
    needs to wait until there are only 2 hashed pids in the pid
    namespace not one.

    v2. Replace thread_pid_vnr(me) == 1 with the test thread_group_leader(me)
    as suggested by Oleg.

    Cc: stable@vger.kernel.org
    Cc: Oleg Nesterov
    Reported-by: Caj Larsson
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Dec, 2012

1 commit

  • Oleg pointed out that in a pid namespace the sequence.
    - pid 1 becomes a zombie
    - setns(thepidns), fork,...
    - reaping pid 1.
    - The injected processes exiting.

    Can lead to processes attempting access their child reaper and
    instead following a stale pointer.

    That waitpid for init can return before all of the processes in
    the pid namespace have exited is also unfortunate.

    Avoid these problems by disabling the allocation of new pids in a pid
    namespace when init dies, instead of when the last process in a pid
    namespace is reaped.

    Pointed-out-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

15 Dec, 2012

1 commit

  • Andy Lutomirski found a nasty little bug in
    the permissions of setns. With unprivileged user namespaces it
    became possible to create new namespaces without privilege.

    However the setns calls were relaxed to only require CAP_SYS_ADMIN in
    the user nameapce of the targed namespace.

    Which made the following nasty sequence possible.

    pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
    if (pid == 0) { /* child */
    system("mount --bind /home/me/passwd /etc/passwd");
    }
    else if (pid != 0) { /* parent */
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
    fd = open(path, O_RDONLY);
    setns(fd, 0);
    system("su -");
    }

    Prevent this possibility by requiring CAP_SYS_ADMIN
    in the current user namespace when joing all but the user namespace.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 Nov, 2012

1 commit

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

19 Nov, 2012

4 commits

  • Unsharing of the pid namespace unlike unsharing of other namespaces
    does not take affect immediately. Instead it affects the children
    created with fork and clone. The first of these children becomes the init
    process of the new pid namespace, the rest become oddball children
    of pid 0. From the point of view of the new pid namespace the process
    that created it is pid 0, as it's pid does not map.

    A couple of different semantics were considered but this one was
    settled on because it is easy to implement and it is usable from
    pam modules. The core reasons for the existence of unshare.

    I took a survey of the callers of pam modules and the following
    appears to be a representative sample of their logic.
    {
    setup stuff include pam
    child = fork();
    if (!child) {
    setuid()
    exec /bin/bash
    }
    waitpid(child);

    pam and other cleanup
    }

    As you can see there is a fork to create the unprivileged user
    space process. Which means that the unprivileged user space
    process will appear as pid 1 in the new pid namespace. Further
    most login processes do not cope with extraneous children which
    means shifting the duty of reaping extraneous child process to
    the creator of those extraneous children makes the system more
    comprehensible.

    The practical reason for this set of pid namespace semantics is
    that it is simple to implement and verify they work correctly.
    Whereas an implementation that requres changing the struct
    pid on a process comes with a lot more races and pain. Not
    the least of which is that glibc caches getpid().

    These semantics are implemented by having two notions
    of the pid namespace of a proces. There is task_active_pid_ns
    which is the pid namspace the process was created with
    and the pid namespace that all pids are presented to
    that process in. The task_active_pid_ns is stored
    in the struct pid of the task.

    Then there is the pid namespace that will be used for children
    that pid namespace is stored in task->nsproxy->pid_ns.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Pid namespaces are designed to be inescapable so verify that the
    passed in pid namespace is a child of the currently active
    pid namespace or the currently active pid namespace itself.

    Allowing the currently active pid namespace is important so
    the effects of an earlier setns can be cancelled.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • task_active_pid_ns(current) != current->ns_proxy->pid_ns will
    soon be allowed to support unshare and setns.

    The definition of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
    we create a child pid namespace of current->ns_proxy->pid_ns. However
    that leads to strange cases like trying to have a single process be
    init in multiple pid namespaces, which is racy and hard to think
    about.

    The definition of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
    we create a child pid namespace of task_active_pid_ns(current). While
    that seems less racy it does not provide any utility.

    Therefore define the semantics of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
    pid namespace creation fails. That is easy to implement and easy
    to think about.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Looking at pid_ns->nr_hashed is a bit simpler and it works for
    disjoint process trees that an unshare or a join of a pid_namespace
    may create.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman