08 Jul, 2020

1 commit

  • So far setns() was missing time namespace support. This was partially due
    to it simply not being implemented but also because vdso_join_timens()
    could still fail which made switching to multiple namespaces atomically
    problematic. This is now fixed so support CLONE_NEWTIME with setns()

    Signed-off-by: Christian Brauner
    Reviewed-by: Andrei Vagin
    Cc: Thomas Gleixner
    Cc: Michael Kerrisk
    Cc: Serge Hallyn
    Cc: Dmitry Safonov
    Link: https://lore.kernel.org/r/20200706154912.3248030-4-christian.brauner@ubuntu.com

    Christian Brauner
     

17 Jun, 2020

1 commit

  • The LTP testsuite reported a regression where users would now see EBADF
    returned instead of EINVAL when an fd was passed that referred to an open
    file but the file was not a nsfd. Fix this by continuing to report EINVAL.

    Reported-by: kernel test robot
    Cc: Jan Stancek
    Cc: Cyril Hrubis
    Link: https://lore.kernel.org/lkml/20200615085836.GR12456@shao2-debian
    Fixes: 303cc571d107 ("nsproxy: attach to namespaces via pidfds")
    Signed-off-by: Christian Brauner

    Christian Brauner
     

13 May, 2020

1 commit

  • For quite a while we have been thinking about using pidfds to attach to
    namespaces. This patchset has existed for about a year already but we've
    wanted to wait to see how the general api would be received and adopted.
    Now that more and more programs in userspace have started using pidfds
    for process management it's time to send this one out.

    This patch makes it possible to use pidfds to attach to the namespaces
    of another process, i.e. they can be passed as the first argument to the
    setns() syscall. When only a single namespace type is specified the
    semantics are equivalent to passing an nsfd. That means
    setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
    when a pidfd is passed, multiple namespace flags can be specified in the
    second setns() argument and setns() will attach the caller to all the
    specified namespaces all at once or to none of them. Specifying 0 is not
    valid together with a pidfd.

    Here are just two obvious examples:
    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);
    Allowing to also attach subsets of namespaces supports various use-cases
    where callers setns to a subset of namespaces to retain privilege, perform
    an action and then re-attach another subset of namespaces.

    If the need arises, as Eric suggested, we can extend this patchset to
    assume even more context than just attaching all namespaces. His suggestion
    specifically was about assuming the process' root directory when
    setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
    keep it flexible in terms of supporting subsets of namespaces but let's
    wait until we have users asking for even more context to be assumed. At
    that point we can add an extension.

    The obvious example where this is useful is a standard container
    manager interacting with a running container: pushing and pulling files
    or directories, injecting mounts, attaching/execing any kind of process,
    managing network devices all these operations require attaching to all
    or at least multiple namespaces at the same time. Given that nowadays
    most containers are spawned with all namespaces enabled we're currently
    looking at at least 14 syscalls, 7 to open the /proc//ns/
    nsfds, another 7 to actually perform the namespace switch. With time
    namespaces we're looking at about 16 syscalls.
    (We could amortize the first 7 or 8 syscalls for opening the nsfds by
    stashing them in each container's monitor process but that would mean
    we need to send around those file descriptors through unix sockets
    everytime we want to interact with the container or keep on-disk
    state. Even in scenarios where a caller wants to join a particular
    namespace in a particular order callers still profit from batching
    other namespaces. That mostly applies to the user namespace but
    all container runtimes I found join the user namespace first no matter
    if it privileges or deprivileges the container similar to how unshare
    behaves.)
    With pidfds this becomes a single syscall no matter how many namespaces
    are supposed to be attached to.

    A decently designed, large-scale container manager usually isn't the
    parent of any of the containers it spawns so the containers don't die
    when it crashes or needs to update or reinitialize. This means that
    for the manager to interact with containers through pids is inherently
    racy especially on systems where the maximum pid number is not
    significicantly bumped. This is even more problematic since we often spawn
    and manage thousands or ten-thousands of containers. Interacting with a
    container through a pid thus can become risky quite quickly. Especially
    since we allow for an administrator to enable advanced features such as
    syscall interception where we're performing syscalls in lieu of the
    container. In all of those cases we use pidfds if they are available and
    we pass them around as stable references. Using them to setns() to the
    target process' namespaces is as reliable as using nsfds. Either the
    target process is already dead and we get ESRCH or we manage to attach
    to its namespaces but we can't accidently attach to another process'
    namespaces. So pidfds lend themselves to be used with this api.
    The other main advantage is that with this change the pidfd becomes the
    only relevant token for most container interactions and it's the only
    token we need to create and send around.

    Apart from significiantly reducing the number of syscalls from double
    digit to single digit which is a decent reason post-spectre/meltdown
    this also allows to switch to a set of namespaces atomically, i.e.
    either attaching to all the specified namespaces succeeds or we fail. If
    we fail we haven't changed a single namespace. There are currently three
    namespaces that can fail (other than for ENOMEM which really is not
    very interesting since we then have other problems anyway) for
    non-trivial reasons, user, mount, and pid namespaces. We can fail to
    attach to a pid namespace if it is not our current active pid namespace
    or a descendant of it. We can fail to attach to a user namespace because
    we are multi-threaded or because our current mount namespace shares
    filesystem state with other tasks, or because we're trying to setns()
    to the same user namespace, i.e. the target task has the same user
    namespace as we do. We can fail to attach to a mount namespace because
    it shares filesystem state with other tasks or because we fail to lookup
    the new root for the new mount namespace. In most non-pathological
    scenarios these issues can be somewhat mitigated. But there are cases where
    we're half-attached to some namespace and failing to attach to another one.
    I've talked about some of these problem during the hallway track (something
    only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
    in 2018(?). Even if all these issues could be avoided with super careful
    userspace coding it would be nicer to have this done in-kernel. Pidfds seem
    to lend themselves nicely for this.

    The other neat thing about this is that setns() becomes an actual
    counterpart to the namespace bits of unshare().

    Signed-off-by: Christian Brauner
    Reviewed-by: Serge Hallyn
    Cc: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Jann Horn
    Cc: Michael Kerrisk
    Cc: Aleksa Sarai
    Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com

    Christian Brauner
     

09 May, 2020

1 commit

  • Add a simple struct nsset. It holds all necessary pieces to switch to a new
    set of namespaces without leaving a task in a half-switched state which we
    will make use of in the next patch. This patch switches the existing setns
    logic over without causing a change in setns() behavior. This brings
    setns() closer to how unshare() works(). The prepare_ns() function is
    responsible to prepare all necessary information. This has two reasons.
    First it minimizes dependencies between individual namespaces, i.e. all
    install handler can expect that all fields are properly initialized
    independent in what order they are called in. Second, this makes the code
    easier to maintain and easier to follow if it needs to be changed.

    The prepare_ns() helper will only be switched over to use a flags argument
    in the next patch. Here it will still use nstype as a simple integer
    argument which was argued would be clearer. I'm not particularly
    opinionated about this if it really helps or not. The struct nsset itself
    already contains the flags field since its name already indicates that it
    can contain information required by different namespaces. None of this
    should have functional consequences.

    Signed-off-by: Christian Brauner
    Reviewed-by: Serge Hallyn
    Cc: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Jann Horn
    Cc: Michael Kerrisk
    Cc: Aleksa Sarai
    Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com

    Christian Brauner
     

14 Jan, 2020

1 commit

  • Time Namespace isolates clock values.

    The kernel provides access to several clocks CLOCK_REALTIME,
    CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.

    CLOCK_REALTIME
    System-wide clock that measures real (i.e., wall-clock) time.

    CLOCK_MONOTONIC
    Clock that cannot be set and represents monotonic time since
    some unspecified starting point.

    CLOCK_BOOTTIME
    Identical to CLOCK_MONOTONIC, except it also includes any time
    that the system is suspended.

    For many users, the time namespace means the ability to changes date and
    time in a container (CLOCK_REALTIME). Providing per namespace notions of
    CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
    value.

    But in the context of checkpoint/restore functionality, monotonic and
    boottime clocks become interesting. Both clocks are monotonic with
    unspecified starting points. These clocks are widely used to measure time
    slices and set timers. After restoring or migrating processes, it has to be
    guaranteed that they never go backward. In an ideal case, the behavior of
    these clocks should be the same as for a case when a whole system is
    suspended. All this means that it is required to set CLOCK_MONOTONIC and
    CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
    offsets for clocks.

    A time namespace is similar to a pid namespace in the way how it is
    created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
    but doesn't set it to the current process. Then all children of the process
    will be born in the new time namespace, or a process can use the setns()
    system call to join a namespace.

    This scheme allows setting clock offsets for a namespace, before any
    processes appear in it.

    All available clone flags have been used, so CLONE_NEWTIME uses the highest
    bit of CSIGNAL. It means that it can be used only with the unshare() and
    the clone3() system calls.

    [ tglx: Adjusted paragraph about clone3() to reality and massaged the
    changelog a bit. ]

    Co-developed-by: Dmitry Safonov
    Signed-off-by: Andrei Vagin
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Thomas Gleixner
    Link: https://criu.org/Time_namespace
    Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
    Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com

    Andrei Vagin
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

14 Mar, 2017

1 commit

  • With the advert of container technologies like docker, that depend on
    namespaces for isolation, there is a need for tracing support for
    namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
    recording namespaces related info. By recording info for every
    namespace, it is left to userspace to take a call on the definition of a
    container and trace containers by updating perf tool accordingly.

    Each namespace has a combination of device and inode numbers. Though
    every namespace has the same device number currently, that may change in
    future to avoid the need for a namespace of namespaces. Considering such
    possibility, record both device and inode numbers separately for each
    namespace.

    Signed-off-by: Hari Bathini
    Acked-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Ananth N Mavinakayanahalli
    Cc: Aravinda Prasad
    Cc: Brendan Gregg
    Cc: Daniel Borkmann
    Cc: Eric Biederman
    Cc: Sargun Dhillon
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Hari Bathini
     

17 Feb, 2016

1 commit

  • Introduce the ability to create new cgroup namespace. The newly created
    cgroup namespace remembers the cgroup of the process at the point
    of creation of the cgroup namespace (referred as cgroupns-root).
    The main purpose of cgroup namespace is to virtualize the contents
    of /proc/self/cgroup file. Processes inside a cgroup namespace
    are only able to see paths relative to their namespace root
    (unless they are moved outside of their cgroupns-root, at which point
    they will see a relative path from their cgroupns-root).
    For a correctly setup container this enables container-tools
    (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
    containers without leaking system level cgroup hierarchy to the task.
    This patch only implements the 'unshare' part of the cgroupns.

    Signed-off-by: Aditya Kali
    Signed-off-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aditya Kali
     

05 Dec, 2014

2 commits


30 Jul, 2014

1 commit

  • The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
    a sufficiently expensive system call that people have complained.

    Upon inspect nsproxy no longer needs rcu protection for remote reads.
    remote reads are rare. So optimize for same process reads and write
    by switching using rask_lock instead.

    This yields a simpler to understand lock, and a faster setns system call.

    In particular this fixes a performance regression observed
    by Rafael David Tinoco .

    This is effectively a revert of Pavel Emelyanov's commit
    cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
    from 2007. The race this originialy fixed no longer exists as
    do_notify_parent uses task_active_pid_ns(parent) instead of
    parent->nsproxy.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 Sep, 2013

1 commit

  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     

31 Aug, 2013

1 commit


28 Aug, 2013

1 commit


27 Aug, 2013

1 commit


02 May, 2013

1 commit


27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


22 Feb, 2013

1 commit


20 Nov, 2012

4 commits


19 Nov, 2012

5 commits

  • This will allow for support for unprivileged mounts in a new user namespace.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Unsharing of the pid namespace unlike unsharing of other namespaces
    does not take affect immediately. Instead it affects the children
    created with fork and clone. The first of these children becomes the init
    process of the new pid namespace, the rest become oddball children
    of pid 0. From the point of view of the new pid namespace the process
    that created it is pid 0, as it's pid does not map.

    A couple of different semantics were considered but this one was
    settled on because it is easy to implement and it is usable from
    pam modules. The core reasons for the existence of unshare.

    I took a survey of the callers of pam modules and the following
    appears to be a representative sample of their logic.
    {
    setup stuff include pam
    child = fork();
    if (!child) {
    setuid()
    exec /bin/bash
    }
    waitpid(child);

    pam and other cleanup
    }

    As you can see there is a fork to create the unprivileged user
    space process. Which means that the unprivileged user space
    process will appear as pid 1 in the new pid namespace. Further
    most login processes do not cope with extraneous children which
    means shifting the duty of reaping extraneous child process to
    the creator of those extraneous children makes the system more
    comprehensible.

    The practical reason for this set of pid namespace semantics is
    that it is simple to implement and verify they work correctly.
    Whereas an implementation that requres changing the struct
    pid on a process comes with a lot more races and pain. Not
    the least of which is that glibc caches getpid().

    These semantics are implemented by having two notions
    of the pid namespace of a proces. There is task_active_pid_ns
    which is the pid namspace the process was created with
    and the pid namespace that all pids are presented to
    that process in. The task_active_pid_ns is stored
    in the struct pid of the task.

    Then there is the pid namespace that will be used for children
    that pid namespace is stored in task->nsproxy->pid_ns.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
    aka ns_of_pid(task_pid(tsk)) should have the same number of
    cache line misses with the practical difference that
    ns_of_pid(task_pid(tsk)) is released later in a processes life.

    Furthermore by using task_active_pid_ns it becomes trivial
    to write an unshare implementation for the the pid namespace.

    So I have used task_active_pid_ns everywhere I can.

    In fork since the pid has not yet been attached to the
    process I use ns_of_pid, to achieve the same effect.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Capture the the user namespace that creates the pid namespace
    - Use that user namespace to test if it is ok to write to
    /proc/sys/kernel/ns_last_pid.

    Zhao Hongjiang noticed I was missing a put_user_ns
    in when destroying a pid_ns. I have foloded his patch into this one
    so that bisects will work properly.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The user namespace which creates a new network namespace owns that
    namespace and all resources created in it. This way we can target
    capability checks for privileged operations against network resources to
    the user_ns which created the network namespace in which the resource
    lives. Privilege to the user namespace which owns the network
    namespace, or any parent user namespace thereof, provides the same
    privilege to the network resource.

    This patch is reworked from a version originally by
    Serge E. Hallyn

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

20 Jul, 2011

1 commit


27 May, 2011

1 commit

  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     

11 May, 2011

1 commit

  • With the networking stack today there is demand to handle
    multiple network stacks at a time. Not in the context
    of containers but in the context of people doing interesting
    things with routing.

    There is also demand in the context of containers to have
    an efficient way to execute some code in the container itself.
    If nothing else it is very useful ad a debugging technique.

    Both problems can be solved by starting some form of login
    daemon in the namespaces people want access to, or you
    can play games by ptracing a process and getting the
    traced process to do things you want it to do. However
    it turns out that a login daemon or a ptrace puppet
    controller are more code, they are more prone to
    failure, and generally they are less efficient than
    simply changing the namespace of a process to a
    specified one.

    Pieces of this puzzle can also be solved by instead of
    coming up with a general purpose system call coming up
    with targed system calls perhaps socketat that solve
    a subset of the larger problem. Overall that appears
    to be more work for less reward.

    int setns(int fd, int nstype);

    The fd argument is a file descriptor referring to a proc
    file of the namespace you want to switch the process to.

    In the setns system call the nstype is 0 or specifies
    an clone flag of the namespace you intend to change
    to prevent changing a namespace unintentionally.

    v2: Most of the architecture support added by Daniel Lezcano
    v3: ported to v2.6.36-rc4 by: Eric W. Biederman
    v4: Moved wiring up of the system call to another patch
    v5: Cleaned up the system call arguments
    - Changed the order.
    - Modified nstype to take the standard clone flags.
    v6: Added missing error handling as pointed out by Matt Helsley

    Acked-by: Daniel Lezcano
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

24 Mar, 2011

4 commits

  • CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
    because the resource comes from current's own ipc namespace.

    setuid/setgid are to uids in own namespace, so again checks can be against
    current_user_ns().

    Changelog:
    Jan 11: Use task_ns_capable() in place of sched_capable().
    Jan 11: Use nsown_capable() as suggested by Bastian Blank.
    Jan 11: Clarify (hopefully) some logic in futex and sched.c
    Feb 15: use ns_capable for ipc, not nsown_capable
    Feb 23: let copy_ipcs handle setting ipc_ns->user_ns
    Feb 23: pass ns down rather than taking it from current

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Changelog:
    Feb 15: Don't set new ipc->user_ns if we didn't create a new
    ipc_ns.
    Feb 23: Move extern declaration to ipc_namespace.h, and group
    fwd declarations at top.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Changelog:
    Feb 23: let clone_uts_ns() handle setting uts->user_ns
    To do so we need to pass in the task_struct who'll
    get the utsname, so we can get its user_ns.
    Feb 23: As per Oleg's coment, just pass in tsk, instead of two
    of its members.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • The expected course of development for user namespaces targeted
    capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

    Goals:

    - Make it safe for an unprivileged user to unshare namespaces. They
    will be privileged with respect to the new namespace, but this should
    only include resources which the unprivileged user already owns.

    - Provide separate limits and accounting for userids in different
    namespaces.

    Status:

    Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
    get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
    CAP_SETGID capabilities. What this gets you is a whole new set of
    userids, meaning that user 500 will have a different 'struct user' in
    your namespace than in other namespaces. So any accounting information
    stored in struct user will be unique to your namespace.

    However, throughout the kernel there are checks which

    - simply check for a capability. Since root in a child namespace
    has all capabilities, this means that a child namespace is not
    constrained.

    - simply compare uid1 == uid2. Since these are the integer uids,
    uid 500 in namespace 1 will be said to be equal to uid 500 in
    namespace 2.

    As a result, the lxc implementation at lxc.sf.net does not use user
    namespaces. This is actually helpful because it leaves us free to
    develop user namespaces in such a way that, for some time, user
    namespaces may be unuseful.

    Bugs aside, this patchset is supposed to not at all affect systems which
    are not actively using user namespaces, and only restrict what tasks in
    child user namespace can do. They begin to limit privilege to a user
    namespace, so that root in a container cannot kill or ptrace tasks in the
    parent user namespace, and can only get world access rights to files.
    Since all files currently belong to the initila user namespace, that means
    that child user namespaces can only get world access rights to *all*
    files. While this temporarily makes user namespaces bad for system
    containers, it starts to get useful for some sandboxing.

    I've run the 'runltplite.sh' with and without this patchset and found no
    difference.

    This patch:

    copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
    So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
    will have the new user namespace as its owner. That is what we want,
    since we want root in that new userns to be able to have privilege over
    it.

    Changelog:
    Feb 15: don't set uts_ns->user_ns if we didn't create
    a new uts_ns.
    Feb 23: Move extern init_user_ns declaration from
    init/version.c to utsname.h.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

13 Mar, 2010

1 commit

  • Remove INIT_NSPROXY(), use C99 initializer.
    Remove INIT_IPC_NS(), INIT_NET_NS() while I'm at it.

    Note: headers trim will be done later, now it's quite pointless because
    results will be invalidated by merge window.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

19 Jun, 2009

1 commit

  • clone_nsproxy() does useless copying of old nsproxy -- every pointer will
    be rewritten to new ns or to old ns. Remove copying, rename
    clone_nsproxy(), create_nsproxy() will be used by C/R code to create fresh
    nsproxy on restart.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

25 Nov, 2008

1 commit

  • The user_ns is moved from nsproxy to user_struct, so that a struct
    cred by itself is sufficient to determine access (which it otherwise
    would not be). Corresponding ecryptfs fixes (by David Howells) are
    here as well.

    Fix refcounting. The following rules now apply:
    1. The task pins the user struct.
    2. The user struct pins its user namespace.
    3. The user namespace pins the struct user which created it.

    User namespaces are cloned during copy_creds(). Unsharing a new user_ns
    is no longer possible. (We could re-add that, but it'll cause code
    duplication and doesn't seem useful if PAM doesn't need to clone user
    namespaces).

    When a user namespace is created, its first user (uid 0) gets empty
    keyrings and a clean group_info.

    This incorporates a previous patch by David Howells. Here
    is his original patch description:

    >I suggest adding the attached incremental patch. It makes the following
    >changes:
    >
    > (1) Provides a current_user_ns() macro to wrap accesses to current's user
    > namespace.
    >
    > (2) Fixes eCryptFS.
    >
    > (3) Renames create_new_userns() to create_user_ns() to be more consistent
    > with the other associated functions and because the 'new' in the name is
    > superfluous.
    >
    > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
    > beginning of do_fork() so that they're done prior to making any attempts
    > at allocation.
    >
    > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
    > to fill in rather than have it return the new root user. I don't imagine
    > the new root user being used for anything other than filling in a cred
    > struct.
    >
    > This also permits me to get rid of a get_uid() and a free_uid(), as the
    > reference the creds were holding on the old user_struct can just be
    > transferred to the new namespace's creator pointer.
    >
    > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
    > preparation rather than doing it in copy_creds().
    >
    >David

    >Signed-off-by: David Howells

    Changelog:
    Oct 20: integrate dhowells comments
    1. leave thread_keyring alone
    2. use current_user_ns() in set_user()

    Signed-off-by: Serge Hallyn

    Serge Hallyn