23 Sep, 2016

3 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • The current error codes returned when a the per user per user
    namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
    asked for advice on linux-api and it we made clear that those were
    the wrong error code, but a correct effor code was not suggested.

    The best general error code I have found for hitting a resource limit
    is ENOSPC. It is not perfect but as it is unambiguous it will serve
    until someone comes up with a better error code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Aug, 2016

1 commit


05 Dec, 2014

5 commits


30 Jul, 2014

1 commit

  • The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
    a sufficiently expensive system call that people have complained.

    Upon inspect nsproxy no longer needs rcu protection for remote reads.
    remote reads are rare. So optimize for same process reads and write
    by switching using rask_lock instead.

    This yields a simpler to understand lock, and a faster setns system call.

    In particular this fixes a performance regression observed
    by Rafael David Tinoco .

    This is effectively a revert of Pavel Emelyanov's commit
    cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
    from 2007. The race this originialy fixed no longer exists as
    do_notify_parent uses task_active_pid_ns(parent) instead of
    parent->nsproxy.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

31 Aug, 2013

1 commit


02 May, 2013

1 commit


28 Feb, 2013

1 commit


15 Dec, 2012

1 commit

  • Andy Lutomirski found a nasty little bug in
    the permissions of setns. With unprivileged user namespaces it
    became possible to create new namespaces without privilege.

    However the setns calls were relaxed to only require CAP_SYS_ADMIN in
    the user nameapce of the targed namespace.

    Which made the following nasty sequence possible.

    pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
    if (pid == 0) { /* child */
    system("mount --bind /home/me/passwd /etc/passwd");
    }
    else if (pid != 0) { /* parent */
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
    fd = open(path, O_RDONLY);
    setns(fd, 0);
    system("su -");
    }

    Prevent this possibility by requiring CAP_SYS_ADMIN
    in the current user namespace when joing all but the user namespace.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 Nov, 2012

3 commits

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Modify create_new_namespaces to explicitly take a user namespace
    parameter, instead of implicitly through the task_struct.

    This allows an implementation of unshare(CLONE_NEWUSER) where
    the new user namespace is not stored onto the current task_struct
    until after all of the namespaces are created.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • - Push the permission check from the core setns syscall into
    the setns install methods where the user namespace of the
    target namespace can be determined, and used in a ns_capable
    call.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 Apr, 2012

1 commit


31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

11 May, 2011

1 commit


24 Mar, 2011

2 commits

  • Changelog:
    Feb 23: let clone_uts_ns() handle setting uts->user_ns
    To do so we need to pass in the task_struct who'll
    get the utsname, so we can get its user_ns.
    Feb 23: As per Oleg's coment, just pass in tsk, instead of two
    of its members.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • The expected course of development for user namespaces targeted
    capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

    Goals:

    - Make it safe for an unprivileged user to unshare namespaces. They
    will be privileged with respect to the new namespace, but this should
    only include resources which the unprivileged user already owns.

    - Provide separate limits and accounting for userids in different
    namespaces.

    Status:

    Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
    get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
    CAP_SETGID capabilities. What this gets you is a whole new set of
    userids, meaning that user 500 will have a different 'struct user' in
    your namespace than in other namespaces. So any accounting information
    stored in struct user will be unique to your namespace.

    However, throughout the kernel there are checks which

    - simply check for a capability. Since root in a child namespace
    has all capabilities, this means that a child namespace is not
    constrained.

    - simply compare uid1 == uid2. Since these are the integer uids,
    uid 500 in namespace 1 will be said to be equal to uid 500 in
    namespace 2.

    As a result, the lxc implementation at lxc.sf.net does not use user
    namespaces. This is actually helpful because it leaves us free to
    develop user namespaces in such a way that, for some time, user
    namespaces may be unuseful.

    Bugs aside, this patchset is supposed to not at all affect systems which
    are not actively using user namespaces, and only restrict what tasks in
    child user namespace can do. They begin to limit privilege to a user
    namespace, so that root in a container cannot kill or ptrace tasks in the
    parent user namespace, and can only get world access rights to files.
    Since all files currently belong to the initila user namespace, that means
    that child user namespaces can only get world access rights to *all*
    files. While this temporarily makes user namespaces bad for system
    containers, it starts to get useful for some sandboxing.

    I've run the 'runltplite.sh' with and without this patchset and found no
    difference.

    This patch:

    copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
    So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
    will have the new user namespace as its owner. That is what we want,
    since we want root in that new userns to be able to have privilege over
    it.

    Changelog:
    Feb 15: don't set uts_ns->user_ns if we didn't create
    a new uts_ns.
    Feb 23: Move extern init_user_ns declaration from
    init/version.c to utsname.h.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

19 Jun, 2009

1 commit


24 Aug, 2008

1 commit


29 Apr, 2008

1 commit


20 Sep, 2007

1 commit

  • struct utsname is copied from master one without any exclusion.

    Here is sample output from one proggie doing

    sethostname("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
    sethostname("bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb");

    and another

    clone(,, CLONE_NEWUTS, ...)
    uname()

    hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaabbbbb'
    hostname = 'bbbaaaaaaaaaaaaaaaaaaaaaaaaaaa'
    hostname = 'aaaaaaaabbbbbbbbbbbbbbbbbbbbbb'
    hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaaabbbb'
    hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaabb'
    hostname = 'aaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
    hostname = 'bbbbbbbbbbbbbbbbaaaaaaaaaaaaaa'

    Hostname is sometimes corrupted.

    Yes, even _the_ simplest namespace activity had bug in it. :-(

    Signed-off-by: Alexey Dobriyan
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

17 Jul, 2007

2 commits

  • While working on unshare support for the network namespace I noticed we
    were putting clone flags in an int. Which is weird because the syscall
    uses unsigned long and we at least need an unsigned to properly hold all of
    the unshare flags.

    So to make the code consistent, this patch updates the code to use
    unsigned long instead of int for the clone flags in those places
    where we get it wrong today.

    Signed-off-by: Eric W. Biederman
    Acked-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • dup_mnt_ns() and clone_uts_ns() return NULL on failure. This is wrong,
    create_new_namespaces() uses ERR_PTR() to catch an error. This means that the
    subsequent create_new_namespaces() will hit BUG_ON() in copy_mnt_ns() or
    copy_utsname().

    Modify create_new_namespaces() to also use the errors returned by the
    copy_*_ns routines and not to systematically return ENOMEM.

    [oleg@tv-sign.ru: better changelog]
    Signed-off-by: Cedric Le Goater
    Cc: Serge E. Hallyn
    Cc: Badari Pulavarty
    Cc: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     

09 May, 2007

1 commit

  • sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
    namespaces. But they have different code paths.

    This patch merges all the nsproxy and its associated namespace copy/clone
    handling (as much as possible). Posted on container list earlier for
    feedback.

    - Create a new nsproxy and its associated namespaces and pass it back to
    caller to attach it to right process.

    - Changed all copy_*_ns() routines to return a new copy of namespace
    instead of attaching it to task->nsproxy.

    - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.

    - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
    just incase.

    - Get rid of all individual unshare_*_ns() routines and make use of
    copy_*_ns() instead.

    [akpm@osdl.org: cleanups, warning fix]
    [clg@fr.ibm.com: remove dup_namespaces() declaration]
    [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
    [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Cedric Le Goater
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

02 Oct, 2006

2 commits

  • Implement a CLONE_NEWUTS flag, and use it at clone and sys_unshare.

    [clg@fr.ibm.com: IPC unshare fix]
    [bunk@stusta.de: cleanup]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Adrian Bunk
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch defines the uts namespace and some manipulators.
    Adds the uts namespace to task_struct, and initializes a
    system-wide init namespace.

    It leaves a #define for system_utsname so sysctl will compile.
    This define will be removed in a separate patch.

    [akpm@osdl.org: build fix, cleanup]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn