20 Jul, 2011

1 commit


27 May, 2011

1 commit

  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     

01 Apr, 2009

1 commit


25 Nov, 2008

1 commit

  • The user_ns is moved from nsproxy to user_struct, so that a struct
    cred by itself is sufficient to determine access (which it otherwise
    would not be). Corresponding ecryptfs fixes (by David Howells) are
    here as well.

    Fix refcounting. The following rules now apply:
    1. The task pins the user struct.
    2. The user struct pins its user namespace.
    3. The user namespace pins the struct user which created it.

    User namespaces are cloned during copy_creds(). Unsharing a new user_ns
    is no longer possible. (We could re-add that, but it'll cause code
    duplication and doesn't seem useful if PAM doesn't need to clone user
    namespaces).

    When a user namespace is created, its first user (uid 0) gets empty
    keyrings and a clean group_info.

    This incorporates a previous patch by David Howells. Here
    is his original patch description:

    >I suggest adding the attached incremental patch. It makes the following
    >changes:
    >
    > (1) Provides a current_user_ns() macro to wrap accesses to current's user
    > namespace.
    >
    > (2) Fixes eCryptFS.
    >
    > (3) Renames create_new_userns() to create_user_ns() to be more consistent
    > with the other associated functions and because the 'new' in the name is
    > superfluous.
    >
    > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
    > beginning of do_fork() so that they're done prior to making any attempts
    > at allocation.
    >
    > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
    > to fill in rather than have it return the new root user. I don't imagine
    > the new root user being used for anything other than filling in a cred
    > struct.
    >
    > This also permits me to get rid of a get_uid() and a free_uid(), as the
    > reference the creds were holding on the old user_struct can just be
    > transferred to the new namespace's creator pointer.
    >
    > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
    > preparation rather than doing it in copy_creds().
    >
    >David

    >Signed-off-by: David Howells

    Changelog:
    Oct 20: integrate dhowells comments
    1. leave thread_keyring alone
    2. use current_user_ns() in set_user()

    Signed-off-by: Serge Hallyn

    Serge Hallyn
     

26 Jul, 2008

1 commit

  • cgroup_clone creates a new cgroup with the pid of the task. This works
    correctly for unshare, but for clone cgroup_clone is called from
    copy_namespaces inside copy_process, which happens before the new pid is
    created. As a result, the new cgroup was created with current's pid.
    This patch:

    1. Moves the call inside copy_process to after the new pid
    is created
    2. Passes the struct pid into ns_cgroup_clone (as it is not
    yet attached to the task)
    3. Passes a name from ns_cgroup_clone() into cgroup_clone()
    so as to keep cgroup_clone() itself simpler
    4. Uses pid_vnr() to get the process id value, so that the
    pid used to name the new cgroup is always the pid as it
    would be known to the task which did the cloning or
    unsharing. I think that is the most intuitive thing to
    do. This way, task t1 does clone(CLONE_NEWPID) to get
    t2, which does clone(CLONE_NEWPID) to get t3, then the
    cgroup for t3 will be named for the pid by which t2 knows
    t3.

    (Thanks to Dan Smith for finding the main bug)

    Changelog:
    June 11: Incorporate Paul Menage's feedback: don't pass
    NULL to ns_cgroup_clone from unshare, and reduce
    patch size by using 'nodename' in cgroup_clone.
    June 10: Original version

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge Hallyn
    Acked-by: Paul Menage
    Tested-by: Dan Smith
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

20 Oct, 2007

2 commits

  • When someone wants to deal with some other taks's namespaces it has to lock
    the task and then to get the desired namespace if the one exists. This is
    slow on read-only paths and may be impossible in some cases.

    E.g. Oleg recently noticed a race between unshare() and the (sent for
    review in cgroups) pid namespaces - when the task notifies the parent it
    has to know the parent's namespace, but taking the task_lock() is
    impossible there - the code is under write locked tasklist lock.

    On the other hand switching the namespace on task (daemonize) and releasing
    the namespace (after the last task exit) is rather rare operation and we
    can sacrifice its speed to solve the issues above.

    The access to other task namespaces is proposed to be performed
    like this:

    rcu_read_lock();
    nsproxy = task_nsproxy(tsk);
    if (nsproxy != NULL) {
    / *
    * work with the namespaces here
    * e.g. get the reference on one of them
    * /
    } / *
    * NULL task_nsproxy() means that this task is
    * almost dead (zombie)
    * /
    rcu_read_unlock();

    This patch has passed the review by Eric and Oleg :) and,
    of course, tested.

    [clg@fr.ibm.com: fix unshare()]
    [ebiederm@xmission.com: Update get_net_ns_by_pid]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Serge Hallyn
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When a task enters a new namespace via a clone() or unshare(), a new cgroup
    is created and the task moves into it.

    This version names cgroups which are automatically created using
    cgroup_clone() as "node_" where pid is the pid of the unsharing or
    cloned process. (Thanks Pavel for the idea) This is safe because if the
    process unshares again, it will create

    /cgroups/(...)/node_/node_

    The only possibilities (AFAICT) for a -EEXIST on unshare are

    1. pid wraparound
    2. a process fails an unshare, then tries again.

    Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the
    node_ will be empty and can be rmdir'ed to make the subsequent unshare()
    succeed.

    Changelog:
    Name cloned cgroups as "node_".

    [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
    Signed-off-by: Serge E. Hallyn
    Cc: Paul Menage
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

17 Oct, 2007

1 commit

  • The nslock spinlock is not used in the kernel at all. Remove it.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

11 Oct, 2007

1 commit


17 Jul, 2007

2 commits

  • While working on unshare support for the network namespace I noticed we
    were putting clone flags in an int. Which is weird because the syscall
    uses unsigned long and we at least need an unsigned to properly hold all of
    the unshare flags.

    So to make the code consistent, this patch updates the code to use
    unsigned long instead of int for the clone flags in those places
    where we get it wrong today.

    Signed-off-by: Eric W. Biederman
    Acked-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Basically, it will allow a process to unshare its user_struct table,
    resetting at the same time its own user_struct and all the associated
    accounting.

    A new root user (uid == 0) is added to the user namespace upon creation.
    Such root users have full privileges and it seems that theses privileges
    should be controlled through some means (process capabilities ?)

    The unshare is not included in this patch.

    Changes since [try #4]:
    - Updated get_user_ns and put_user_ns to accept NULL, and
    get_user_ns to return the namespace.

    Changes since [try #3]:
    - moved struct user_namespace to files user_namespace.{c,h}

    Changes since [try #2]:
    - removed struct user_namespace* argument from find_user()

    Changes since [try #1]:
    - removed struct user_namespace* argument from find_user()
    - added a root_user per user namespace

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Acked-by: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     

09 May, 2007

1 commit

  • sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
    namespaces. But they have different code paths.

    This patch merges all the nsproxy and its associated namespace copy/clone
    handling (as much as possible). Posted on container list earlier for
    feedback.

    - Create a new nsproxy and its associated namespaces and pass it back to
    caller to attach it to right process.

    - Changed all copy_*_ns() routines to return a new copy of namespace
    instead of attaching it to task->nsproxy.

    - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.

    - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
    just incase.

    - Get rid of all individual unshare_*_ns() routines and make use of
    copy_*_ns() instead.

    [akpm@osdl.org: cleanups, warning fix]
    [clg@fr.ibm.com: remove dup_namespaces() declaration]
    [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
    [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Cedric Le Goater
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

31 Jan, 2007

2 commits

  • This reverts commit 7a238fcba0629b6f2edbcd37458bae56fcf36be5 in
    preparation for a better and simpler fix proposed by Eric Biederman
    (and fixed up by Serge Hallyn)

    Acked-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Fix exit race by splitting the nsproxy putting into two pieces. First
    piece reduces the nsproxy refcount. If we dropped the last reference, then
    it puts the mnt_ns, and returns the nsproxy as a hint to the caller. Else
    it returns NULL. The second piece of exiting task namespaces sets
    tsk->nsproxy to NULL, and drops the references to other namespaces and
    frees the nsproxy only if an nsproxy was passed in.

    A little awkward and should probably be reworked, but hopefully it fixes
    the NFS oops.

    Signed-off-by: Serge E. Hallyn
    Cc: Herbert Poetzl
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Cedric Le Goater
    Cc: Daniel Hokka Zakrisson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

14 Dec, 2006

1 commit

  • This reverts commit 373beb35cd6b625e0ba4ad98baace12310a26aa8.

    No one is using this identifier yet. The purpose of this identifier is to
    export nsproxy to user space which is wrong. nsproxy is an internal
    implementation optimization, which should keep our fork times from getting
    slower as we increase the number of global namespaces you don't have to
    share.

    Adding a global identifier like this is inappropriate because it makes
    namespaces inherently non-recursive, greatly limiting what we can do with
    them in the future.

    Signed-off-by: Eric W. Biederman
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

09 Dec, 2006

3 commits

  • Add the pid namespace framework to the nsproxy object. The copy of the pid
    namespace only increases the refcount on the global pid namespace,
    init_pid_ns, and unshare is not implemented.

    There is no configuration option to activate or deactivate this feature
    because this not relevant for the moment.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Add an identifier to nsproxy. The default init_ns_proxy has identifier 0 and
    allocated nsproxies are given -1.

    This identifier will be used by a new syscall sys_bind_ns.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
    other namespaces being developped for the containers : pid, uts, ipc, etc.
    'namespace' variables and attributes are also renamed to 'mnt_ns'

    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

26 Nov, 2006

1 commit

  • OpenVZ developers team has encountered the following problem in 2.6.19-rc6
    kernel. After some seconds of running script

    while [[ 1 ]]
    do
    find /proc -name mountstats | xargs cat
    done

    this Oops appears:

    BUG: unable to handle kernel NULL pointer dereference at virtual address
    00000010
    printing eip:
    c01a6b70
    *pde = 00000000
    Oops: 0000 [#1]
    SMP
    Modules linked in: xt_length ipt_ttl xt_tcpmss ipt_TCPMSS iptable_mangle
    iptable_filter xt_multiport xt_limit ipt_tos ipt_REJECT ip_tables x_tables
    parport_pc lp parport sunrpc af_packet thermal processor fan button battery
    asus_acpi ac ohci_hcd ehci_hcd usbcore i2c_nforce2 i2c_core tg3 floppy
    pata_amd
    ide_cd cdrom sata_nv libata
    CPU: 1
    EIP: 0060:[] Not tainted VLI
    EFLAGS: 00010246 (2.6.19-rc6 #2)
    EIP is at mountstats_open+0x70/0xf0
    eax: 00000000 ebx: e6247030 ecx: e62470f8 edx: 00000000
    esi: 00000000 edi: c01a6b00 ebp: c33b83c0 esp: f4105eb4
    ds: 007b es: 007b ss: 0068
    Process cat (pid: 6044, ti=f4105000 task=f4104a70 task.ti=f4105000)
    Stack: c33b83c0 c04ee940 f46a4a80 c33b83c0 e4df31b4 c01a6b00 f4105000 c0169231
    e4df31b4 c33b83c0 c33b83c0 f4105f20 00000003 f4105000 c0169445 f2503cf0
    f7f8c4c0 00008000 c33b83c0 00000000 00008000 c0169350 f4105f20 00008000
    Call Trace:
    [] mountstats_open+0x0/0xf0
    [] __dentry_open+0x181/0x250
    [] nameidata_to_filp+0x35/0x50
    [] do_filp_open+0x50/0x60
    [] seq_read+0xc6/0x300
    [] get_unused_fd+0x31/0xc0
    [] do_sys_open+0x63/0x110
    [] sys_open+0x27/0x30
    [] sysenter_past_esp+0x56/0x79
    =======================
    Code: 45 74 8b 54 24 20 89 44 24 08 8b 42 f0 31 d2 e8 47 cb f8 ff 85 c0 89 c3
    74 51 8d 80 a0 04 00 00 e8 46 06 2c 00 8b 83 48 04 00 00 78 10 85 ff 74
    03
    f0 ff 07 b0 01 86 83 a0 04 00 00 f0 ff 4b
    EIP: [] mountstats_open+0x70/0xf0 SS:ESP 0068:f4105eb4

    The problem is that task->nsproxy can be equal NULL for some time during
    task exit. This patch fixes the BUG.

    Signed-off-by: Vasily Tarasov
    Cc: Herbert Poetzl
    Cc: "Serge E. Hallyn"
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Tarasov
     

02 Oct, 2006

4 commits

  • This patch set allows to unshare IPCs and have a private set of IPC objects
    (sem, shm, msg) inside namespace. Basically, it is another building block of
    containers functionality.

    This patch implements core IPC namespace changes:
    - ipc_namespace structure
    - new config option CONFIG_IPC_NS
    - adds CLONE_NEWIPC flag
    - unshare support

    [clg@fr.ibm.com: small fix for unshare of ipc namespace]
    [akpm@osdl.org: build fix]
    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • This patch defines the uts namespace and some manipulators.
    Adds the uts namespace to task_struct, and initializes a
    system-wide init namespace.

    It leaves a #define for system_utsname so sysctl will compile.
    This define will be removed in a separate patch.

    [akpm@osdl.org: build fix, cleanup]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This moves the mount namespace into the nsproxy. The mount namespace count
    now refers to the number of nsproxies point to it, rather than the number of
    tasks. As a result, the unshare_namespace() function in kernel/fork.c no
    longer checks whether it is being shared.

    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch adds a nsproxy structure to the task struct. Later patches will
    move the fs namespace pointer into this structure, and introduce a new utsname
    namespace into the nsproxy.

    The vserver and openvz functionality, then, would be implemented in large part
    by virtualizing/isolating more and more resources into namespaces, each
    contained in the nsproxy.

    [akpm@osdl.org: build fix]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn