23 Sep, 2016

3 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • The current error codes returned when a the per user per user
    namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
    asked for advice on linux-api and it we made clear that those were
    the wrong error code, but a correct effor code was not suggested.

    The best general error code I have found for hitting a resource limit
    is ENOSPC. It is not perfect but as it is unambiguous it will serve
    until someone comes up with a better error code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Aug, 2016

1 commit


03 Aug, 2016

1 commit


24 Jun, 2016

1 commit

  • Allow the ipc namespace initialization code to depend on ns->user_ns
    being set during initialization.

    In particular this allows mq_init_ns to use ns->user_ns for permission
    checks and initializating s_user_ns while the the mq filesystem is
    being mounted.

    Acked-by: Seth Forshee
    Suggested-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

14 Dec, 2014

1 commit

  • SysV can be abused to allocate locked kernel memory. For most systems, a
    small limit doesn't make sense, see the discussion with regards to SHMMAX.

    Therefore: increase MSGMNI to the maximum supported.

    And: If we ignore the risk of locking too much memory, then an automatic
    scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.

    The code preserves auto_msgmni to avoid breaking any user space applications
    that expect that the value exists.

    Notes:
    1) If an administrator must limit the memory allocations, then he can set
    MSGMNI as necessary.

    Or he can disable sysv entirely (as e.g. done by Android).

    2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
    to control latency vs. throughput:
    If MSGMNB is large, then msgsnd() just returns and more messages can be queued
    before a task switch to a task that calls msgrcv() is forced.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

05 Dec, 2014

5 commits


30 Jul, 2014

1 commit

  • The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
    a sufficiently expensive system call that people have complained.

    Upon inspect nsproxy no longer needs rcu protection for remote reads.
    remote reads are rare. So optimize for same process reads and write
    by switching using rask_lock instead.

    This yields a simpler to understand lock, and a faster setns system call.

    In particular this fixes a performance regression observed
    by Rafael David Tinoco .

    This is effectively a revert of Pavel Emelyanov's commit
    cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
    from 2007. The race this originialy fixed no longer exists as
    do_notify_parent uses task_active_pid_ns(parent) instead of
    parent->nsproxy.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Sep, 2013

2 commits

  • After previous cleanups and optimizations, this function is no longer
    heavily used and we don't have a good reason to keep it. Update the few
    remaining callers and get rid of it.

    Signed-off-by: Davidlohr Bueso
    Cc: Sedat Dilek
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Since in some situations the lock can be shared for readers, we shouldn't
    be calling it a mutex, rename it to rwsem.

    Signed-off-by: Davidlohr Bueso
    Tested-by: Sedat Dilek
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

31 Aug, 2013

1 commit


02 May, 2013

1 commit


15 Dec, 2012

1 commit

  • Andy Lutomirski found a nasty little bug in
    the permissions of setns. With unprivileged user namespaces it
    became possible to create new namespaces without privilege.

    However the setns calls were relaxed to only require CAP_SYS_ADMIN in
    the user nameapce of the targed namespace.

    Which made the following nasty sequence possible.

    pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
    if (pid == 0) { /* child */
    system("mount --bind /home/me/passwd /etc/passwd");
    }
    else if (pid != 0) { /* parent */
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
    fd = open(path, O_RDONLY);
    setns(fd, 0);
    system("su -");
    }

    Prevent this possibility by requiring CAP_SYS_ADMIN
    in the current user namespace when joing all but the user namespace.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 Nov, 2012

3 commits

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Modify create_new_namespaces to explicitly take a user namespace
    parameter, instead of implicitly through the task_struct.

    This allows an implementation of unshare(CLONE_NEWUSER) where
    the new user namespace is not stored onto the current task_struct
    until after all of the namespaces are created.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • - Push the permission check from the core setns syscall into
    the setns install methods where the user namespace of the
    target namespace can be determined, and used in a ns_capable
    call.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 Apr, 2012

1 commit


11 May, 2011

1 commit


26 Mar, 2011

1 commit

  • commit b515498 ("userns: add a user namespace owner of ipc ns") added a
    user namespace owner of ipc ns, but it also introduced a use after free in
    free_ipc_ns().

    Signed-off-by: Xiaotian Feng
    Acked-by: "Serge E. Hallyn"
    Acked-by: David Howells
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     

24 Mar, 2011

2 commits

  • CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
    because the resource comes from current's own ipc namespace.

    setuid/setgid are to uids in own namespace, so again checks can be against
    current_user_ns().

    Changelog:
    Jan 11: Use task_ns_capable() in place of sched_capable().
    Jan 11: Use nsown_capable() as suggested by Bastian Blank.
    Jan 11: Clarify (hopefully) some logic in futex and sched.c
    Feb 15: use ns_capable for ipc, not nsown_capable
    Feb 23: let copy_ipcs handle setting ipc_ns->user_ns
    Feb 23: pass ns down rather than taking it from current

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Changelog:
    Feb 15: Don't set new ipc->user_ns if we didn't create a new
    ipc_ns.
    Feb 23: Move extern declaration to ipc_namespace.h, and group
    fwd declarations at top.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

19 Jun, 2009

3 commits


07 Apr, 2009

2 commits

  • Implement multiple mounts of the mqueue file system, and link it to usage
    of CLONE_NEWIPC.

    Each ipc ns has a corresponding mqueuefs superblock. When a user does
    clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
    internal mount of a new mqueuefs sb linked to the new ipc ns.

    When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
    mqueuefs superblock.

    Posix message queues can be worked with both through the mq_* system calls
    (see mq_overview(7)), and through the VFS through the mqueue mount. Any
    usage of mq_open() and friends will work with the acting task's ipc
    namespace. Any actions through the VFS will work with the mqueuefs in
    which the file was created. So if a user doesn't remount mqueuefs after
    unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
    /dev/mqueue".

    If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
    ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
    ipc_ns:1 will be freed, (2) it's superblock will live on until task b
    umounts the corresponding mqueuefs, and vfs actions will continue to
    succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
    the deceased ipc_ns:1.

    To make this happen, we must protect the ipc reference count when

    a) a task exits and drops its ipcns->count, since it might be dropping
    it to 0 and freeing the ipcns

    b) a task accesses the ipcns through its mqueuefs interface, since it
    bumps the ipcns refcount and might race with the last task in the ipcns
    exiting.

    So the kref is changed to an atomic_t so we can use
    atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
    through ns = mqueuefs_sb->s_fs_info is protected by the same lock.

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Move mqueue vfsmount plus a few tunables into the ipc_namespace struct.
    The CONFIG_IPC_NS boolean and the ipc_namespace struct will serve both the
    posix message queue namespaces and the SYSV ipc namespaces.

    The sysctl code will be fixed separately in patch 3. After just this
    patch, making a change to posix mqueue tunables always changes the values
    in the initial ipc namespace.

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

29 Apr, 2008

3 commits

  • Introduce a notification mechanism that aims at recomputing msgmni each time
    an ipc namespace is created or removed.

    The ipc namespace notifier chain already defined for memory hotplug management
    is used for that purpose too.

    Each time a new ipc namespace is allocated or an existing ipc namespace is
    removed, the ipcns notifier chain is notified. The callback routine for each
    registered ipc namespace is then activated in order to recompute msgmni for
    that namespace.

    Signed-off-by: Nadia Derbey
    Cc: Yasunori Goto
    Cc: Matt Helsley
    Cc: Mingming Cao
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadia Derbey
     
  • Introduce the registration of a callback routine that recomputes msg_ctlmni
    upon memory add / remove.

    A single notifier block is registered in the hotplug memory chain for all the
    ipc namespaces.

    Since the ipc namespaces are not linked together, they have their own
    notification chain: one notifier_block is defined per ipc namespace.

    Each time an ipc namespace is created (removed) it registers (unregisters) its
    notifier block in (from) the ipcns chain. The callback routine registered in
    the memory chain invokes the ipcns notifier chain with the IPCNS_LOWMEM event.
    Each callback routine registered in the ipcns namespace, in turn, recomputes
    msgmni for the owning namespace.

    Signed-off-by: Nadia Derbey
    Cc: Yasunori Goto
    Cc: Matt Helsley
    Cc: Mingming Cao
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadia Derbey
     
  • Since all the namespaces see the same amount of memory (the total one) this
    patch introduces a new variable that counts the ipc namespaces and divides
    msg_ctlmni by this counter.

    Signed-off-by: Nadia Derbey
    Cc: Yasunori Goto
    Cc: Matt Helsley
    Cc: Mingming Cao
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadia Derbey
     

09 Feb, 2008

3 commits

  • sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an
    ipc_namespace is released to free all ipcs of each type. But in fact, they
    do the same thing: they loop around all ipcs to free them individually by
    calling a specific routine.

    This patch proposes to consolidate this by introducing a common function,
    free_ipcs(), that do the job. The specific routine to call on each
    individual ipcs is passed as parameter. For this, these ipc-specific
    'free' routines are reworked to take a generic 'struct ipc_perm' as
    parameter.

    Signed-off-by: Pierre Peiffer
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • Each ipc_namespace contains a table of 3 pointers to struct ipc_ids (3 for
    msg, sem and shm, structure used to store all ipcs) These 'struct ipc_ids'
    are dynamically allocated for each icp_namespace as the ipc_namespace
    itself (for the init namespace, they are initialized with pointers to
    static variables instead)

    It is so for historical reason: in fact, before the use of idr to store the
    ipcs, the ipcs were stored in tables of variable length, depending of the
    maximum number of ipc allowed. Now, these 'struct ipc_ids' have a fixed
    size. As they are allocated in any cases for each new ipc_namespace, there
    is no gain of memory in having them allocated separately of the struct
    ipc_namespace.

    This patch proposes to make this table static in the struct ipc_namespace.
    Thus, we can allocate all in once and get rid of all the code needed to
    allocate and free these ipc_ids separately.

    Signed-off-by: Pierre Peiffer
    Acked-by: Cedric Le Goater
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • Currently the IPC namespace management code is spread over the ipc/*.c files.
    I moved this code into ipc/namespace.c file which is compiled out when needed.

    The linux/ipc_namespace.h file is used to store the prototypes of the
    functions in namespace.c and the stubs for NAMESPACES=n case. This is done
    so, because the stub for copy_ipc_namespace requires the knowledge of the
    CLONE_NEWIPC flag, which is in sched.h. But the linux/ipc.h file itself in
    included into many many .c files via the sys.h->sem.h sequence so adding the
    sched.h into it will make all these .c depend on sched.h which is not that
    good. On the other hand the knowledge about the namespaces stuff is required
    in 4 .c files only.

    Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
    msg.c and shm.c files. It turned out that moving these functions into
    namespaces.c is not that easy because they use many other calls and macros
    from the original file. Moving them would make this patch complicated. On
    the other hand all these functions can be consolidated, so I will send a
    separate patch doing this a bit later.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov