20 Oct, 2007

4 commits

  • When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
    create a new struct pid for the new process. Allocate pid_t's for the new
    process in the new pid namespace and all ancestor pid namespaces. Make the
    newly cloned process the session and process group leader.

    Since the active pid namespace is special and expected to be the first entry
    in pid->upid_list, preserve the order of pid namespaces.

    The size of 'struct pid' is dependent on the the number of pid namespaces the
    process exists in, so we use multiple pid-caches'. Only one pid cache is
    created during system startup and this used by processes that exist only in
    init_pid_ns.

    When a process clones its pid namespace, we create additional pid caches as
    necessary and use the pid cache to allocate 'struct pids' for that depth.

    Note, that with this patch the newly created namespace won't work, since the
    rest of the kernel still uses global pids, but this is to be fixed soon. Init
    pid namespace still works.

    [oleg@tv-sign.ru: merge fix]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When someone wants to deal with some other taks's namespaces it has to lock
    the task and then to get the desired namespace if the one exists. This is
    slow on read-only paths and may be impossible in some cases.

    E.g. Oleg recently noticed a race between unshare() and the (sent for
    review in cgroups) pid namespaces - when the task notifies the parent it
    has to know the parent's namespace, but taking the task_lock() is
    impossible there - the code is under write locked tasklist lock.

    On the other hand switching the namespace on task (daemonize) and releasing
    the namespace (after the last task exit) is rather rare operation and we
    can sacrifice its speed to solve the issues above.

    The access to other task namespaces is proposed to be performed
    like this:

    rcu_read_lock();
    nsproxy = task_nsproxy(tsk);
    if (nsproxy != NULL) {
    / *
    * work with the namespaces here
    * e.g. get the reference on one of them
    * /
    } / *
    * NULL task_nsproxy() means that this task is
    * almost dead (zombie)
    * /
    rcu_read_unlock();

    This patch has passed the review by Eric and Oleg :) and,
    of course, tested.

    [clg@fr.ibm.com: fix unshare()]
    [ebiederm@xmission.com: Update get_net_ns_by_pid]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Serge Hallyn
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • With multiple pid namespaces, a process is known by some pid_t in every
    ancestor pid namespace. Every time the process forks, the child process also
    gets a pid_t in every ancestor pid namespace.

    While a process is visible in >=1 pid namespaces, it can see pid_t's in only
    one pid namespace. We call this pid namespace it's "active pid namespace",
    and it is always the youngest pid namespace in which the process is known.

    This patch defines and uses a wrapper to find the active pid namespace of a
    process. The implementation of the wrapper will be changed in when support
    for multiple pid namespaces are added.

    Changelog:
    2.6.22-rc4-mm2-pidns1:
    - [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
    task_active_pid_ns() in child_reaper() since task->nsproxy
    can be NULL during task exit (so child_reaper() continues to
    use init_pid_ns).

    to implement child_reaper() since init_pid_ns.child_reaper to
    implement child_reaper() since tsk->nsproxy can be NULL during exit.

    2.6.21-rc6-mm1:
    - Rename task_pid_ns() to task_active_pid_ns() to reflect that a
    process can have multiple pid namespaces.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • When a task enters a new namespace via a clone() or unshare(), a new cgroup
    is created and the task moves into it.

    This version names cgroups which are automatically created using
    cgroup_clone() as "node_" where pid is the pid of the unsharing or
    cloned process. (Thanks Pavel for the idea) This is safe because if the
    process unshares again, it will create

    /cgroups/(...)/node_/node_

    The only possibilities (AFAICT) for a -EEXIST on unshare are

    1. pid wraparound
    2. a process fails an unshare, then tries again.

    Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the
    node_ will be empty and can be rmdir'ed to make the subsequent unshare()
    succeed.

    Changelog:
    Name cloned cgroups as "node_".

    [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
    Signed-off-by: Serge E. Hallyn
    Cc: Paul Menage
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

17 Oct, 2007

1 commit


11 Oct, 2007

1 commit

  • This patch allows you to create a new network namespace
    using sys_clone, or sys_unshare.

    As the network namespace is still experimental and under development
    clone and unshare support is only made available when CONFIG_NET_NS is
    selected at compile time.

    As this patch introduces network namespace support into code paths
    that exist when the CONFIG_NET is not selected there are a few
    additions made to net_namespace.h to allow a few more functions
    to be used when the networking stack is not compiled in.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

17 Jul, 2007

6 commits

  • While working on unshare support for the network namespace I noticed we
    were putting clone flags in an int. Which is weird because the syscall
    uses unsigned long and we at least need an unsigned to properly hold all of
    the unshare flags.

    So to make the code consistent, this patch updates the code to use
    unsigned long instead of int for the clone flags in those places
    where we get it wrong today.

    Signed-off-by: Eric W. Biederman
    Acked-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • It should improve performance in some scenarii where a lot of
    these nsproxy objects are created by unsharing namespaces. This is
    a typical use of virtual servers that are being created or entered.

    This is also a good tool to find leaks and gather statistics on
    namespace usage.

    Signed-off-by: Cedric Le Goater
    Cc: Herbert Poetzl
    Cc: Pavel Emelianov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • dup_mnt_ns() and clone_uts_ns() return NULL on failure. This is wrong,
    create_new_namespaces() uses ERR_PTR() to catch an error. This means that the
    subsequent create_new_namespaces() will hit BUG_ON() in copy_mnt_ns() or
    copy_utsname().

    Modify create_new_namespaces() to also use the errors returned by the
    copy_*_ns routines and not to systematically return ENOMEM.

    [oleg@tv-sign.ru: better changelog]
    Signed-off-by: Cedric Le Goater
    Cc: Serge E. Hallyn
    Cc: Badari Pulavarty
    Cc: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • This patch enables the unshare of user namespaces.

    It adds a new clone flag CLONE_NEWUSER and implements copy_user_ns() which
    resets the current user_struct and adds a new root user (uid == 0)

    For now, unsharing the user namespace allows a process to reset its
    user_struct accounting and uid 0 in the new user namespace should be contained
    using appropriate means, for instance selinux

    The plan, when the full support is complete (all uid checks covered), is to
    keep the original user's rights in the original namespace, and let a process
    become uid 0 in the new namespace, with full capabilities to the new
    namespace.

    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Cedric Le Goater
    Acked-by: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Basically, it will allow a process to unshare its user_struct table,
    resetting at the same time its own user_struct and all the associated
    accounting.

    A new root user (uid == 0) is added to the user namespace upon creation.
    Such root users have full privileges and it seems that theses privileges
    should be controlled through some means (process capabilities ?)

    The unshare is not included in this patch.

    Changes since [try #4]:
    - Updated get_user_ns and put_user_ns to accept NULL, and
    get_user_ns to return the namespace.

    Changes since [try #3]:
    - moved struct user_namespace to files user_namespace.{c,h}

    Changes since [try #2]:
    - removed struct user_namespace* argument from find_user()

    Changes since [try #1]:
    - removed struct user_namespace* argument from find_user()
    - added a root_user per user namespace

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Acked-by: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • CONFIG_UTS_NS and CONFIG_IPC_NS have very little value as they only
    deactivate the unshare of the uts and ipc namespaces and do not improve
    performance.

    Signed-off-by: Cedric Le Goater
    Acked-by: "Serge E. Hallyn"
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Pavel Emelianov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     

24 Jun, 2007

1 commit

  • When a namespace is unshared, a refcount on the previous nsproxy is
    abusively taken, leading to a memory leak of nsproxy objects.

    Signed-off-by: Cedric Le Goater
    Cc: Badari Pulavarty
    Cc: Herbert Poetzl
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     

09 May, 2007

1 commit

  • sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
    namespaces. But they have different code paths.

    This patch merges all the nsproxy and its associated namespace copy/clone
    handling (as much as possible). Posted on container list earlier for
    feedback.

    - Create a new nsproxy and its associated namespaces and pass it back to
    caller to attach it to right process.

    - Changed all copy_*_ns() routines to return a new copy of namespace
    instead of attaching it to task->nsproxy.

    - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.

    - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
    just incase.

    - Get rid of all individual unshare_*_ns() routines and make use of
    copy_*_ns() instead.

    [akpm@osdl.org: cleanups, warning fix]
    [clg@fr.ibm.com: remove dup_namespaces() declaration]
    [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
    [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Cedric Le Goater
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

31 Jan, 2007

2 commits

  • This reverts commit 7a238fcba0629b6f2edbcd37458bae56fcf36be5 in
    preparation for a better and simpler fix proposed by Eric Biederman
    (and fixed up by Serge Hallyn)

    Acked-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Fix exit race by splitting the nsproxy putting into two pieces. First
    piece reduces the nsproxy refcount. If we dropped the last reference, then
    it puts the mnt_ns, and returns the nsproxy as a hint to the caller. Else
    it returns NULL. The second piece of exiting task namespaces sets
    tsk->nsproxy to NULL, and drops the references to other namespaces and
    frees the nsproxy only if an nsproxy was passed in.

    A little awkward and should probably be reworked, but hopefully it fixes
    the NFS oops.

    Signed-off-by: Serge E. Hallyn
    Cc: Herbert Poetzl
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Cedric Le Goater
    Cc: Daniel Hokka Zakrisson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

14 Dec, 2006

1 commit

  • This reverts commit 373beb35cd6b625e0ba4ad98baace12310a26aa8.

    No one is using this identifier yet. The purpose of this identifier is to
    export nsproxy to user space which is wrong. nsproxy is an internal
    implementation optimization, which should keep our fork times from getting
    slower as we increase the number of global namespaces you don't have to
    share.

    Adding a global identifier like this is inappropriate because it makes
    namespaces inherently non-recursive, greatly limiting what we can do with
    them in the future.

    Signed-off-by: Eric W. Biederman
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

09 Dec, 2006

3 commits

  • Add the pid namespace framework to the nsproxy object. The copy of the pid
    namespace only increases the refcount on the global pid namespace,
    init_pid_ns, and unshare is not implemented.

    There is no configuration option to activate or deactivate this feature
    because this not relevant for the moment.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Add an identifier to nsproxy. The default init_ns_proxy has identifier 0 and
    allocated nsproxies are given -1.

    This identifier will be used by a new syscall sys_bind_ns.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
    other namespaces being developped for the containers : pid, uts, ipc, etc.
    'namespace' variables and attributes are also renamed to 'mnt_ns'

    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

21 Oct, 2006

1 commit


02 Oct, 2006

7 commits

  • This patch fixes copy_namespaces()'s error path.

    when new nsproxy (new_ns) is created pointers to namespaces (ipc, uts) are
    copied from the old nsproxy. Later in copy_utsname, copy_ipcs, etc.
    according namespaces are get-ed. On error path needed namespaces are
    put-ed, so there's no need to put new nsproxy itelf as it woud cause
    putting namespaces for the second time.

    Found when incorporating namespaces into OpenVZ kernel.

    Signed-off-by: Pavel Emelianov
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel
     
  • This patch set allows to unshare IPCs and have a private set of IPC objects
    (sem, shm, msg) inside namespace. Basically, it is another building block of
    containers functionality.

    This patch implements core IPC namespace changes:
    - ipc_namespace structure
    - new config option CONFIG_IPC_NS
    - adds CLONE_NEWIPC flag
    - unshare support

    [clg@fr.ibm.com: small fix for unshare of ipc namespace]
    [akpm@osdl.org: build fix]
    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • Implement a CLONE_NEWUTS flag, and use it at clone and sys_unshare.

    [clg@fr.ibm.com: IPC unshare fix]
    [bunk@stusta.de: cleanup]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Adrian Bunk
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch defines the uts namespace and some manipulators.
    Adds the uts namespace to task_struct, and initializes a
    system-wide init namespace.

    It leaves a #define for system_utsname so sysctl will compile.
    This define will be removed in a separate patch.

    [akpm@osdl.org: build fix, cleanup]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This moves the mount namespace into the nsproxy. The mount namespace count
    now refers to the number of nsproxies point to it, rather than the number of
    tasks. As a result, the unshare_namespace() function in kernel/fork.c no
    longer checks whether it is being shared.

    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Move the init_nsproxy definition out of arch/ into kernel/nsproxy.c. This
    avoids all arches having to be updated. Compiles and boots on s390.

    Signed-off-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch adds a nsproxy structure to the task struct. Later patches will
    move the fs namespace pointer into this structure, and introduce a new utsname
    namespace into the nsproxy.

    The vserver and openvz functionality, then, would be implemented in large part
    by virtualizing/isolating more and more resources into namespaces, each
    contained in the nsproxy.

    [akpm@osdl.org: build fix]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn