17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • The comments in zap_pid_ns_processes() are not clear, we need to explain
    how this code actually works.

    1. "Ignore SIGCHLD" looks like optimization but it is not, we also
    need this for correctness.

    2. The comment above sys_wait4() could tell more.

    EXIT_ZOMBIE child is only possible if it has exited before we
    ignored SIGCHLD. Or if it is traced from the parent namespace,
    but in this case it will be reaped by debugger after detach,
    sys_wait4() acts as a synchronization point.

    3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is
    outdated. Contrary to what it says we do not need to make sure
    they all go away after 0a01f2cc390e "pidns: Make the pidns proc
    mount/umount logic obvious".

    At the same time, we do need to wait for nr_hashed==init_pids,
    but the reasons are quite different and not obvious: setns().

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Aaron Tomlin
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Dec, 2014

5 commits


03 Apr, 2014

1 commit

  • pidns_get()->get_pid_ns() can hit ns == NULL. This task_struct can't
    go away, but task_active_pid_ns(task) is NULL if release_task(task)
    was already called. Alternatively we could change get_pid_ns(ns) to
    check ns != NULL, but it seems that other callers are fine.

    Signed-off-by: Oleg Nesterov
    Cc: Eric W. Biederman ebiederm@xmission.com>
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

25 Oct, 2013

1 commit


08 Sep, 2013

1 commit

  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     

31 Aug, 2013

1 commit


28 Aug, 2013

1 commit


02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells
     

01 May, 2013

1 commit

  • Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
    simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

    [akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
    Signed-off-by: Raphael S.Carvalho
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S.Carvalho
     

26 Mar, 2013

1 commit

  • When a multi-threaded init exits and the initial thread is not the
    last thread to exit the initial thread hangs around as a zombie
    until the last thread exits. In that case zap_pid_ns_processes
    needs to wait until there are only 2 hashed pids in the pid
    namespace not one.

    v2. Replace thread_pid_vnr(me) == 1 with the test thread_group_leader(me)
    as suggested by Oleg.

    Cc: stable@vger.kernel.org
    Cc: Oleg Nesterov
    Reported-by: Caj Larsson
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Dec, 2012

1 commit

  • Oleg pointed out that in a pid namespace the sequence.
    - pid 1 becomes a zombie
    - setns(thepidns), fork,...
    - reaping pid 1.
    - The injected processes exiting.

    Can lead to processes attempting access their child reaper and
    instead following a stale pointer.

    That waitpid for init can return before all of the processes in
    the pid namespace have exited is also unfortunate.

    Avoid these problems by disabling the allocation of new pids in a pid
    namespace when init dies, instead of when the last process in a pid
    namespace is reaped.

    Pointed-out-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

15 Dec, 2012

1 commit

  • Andy Lutomirski found a nasty little bug in
    the permissions of setns. With unprivileged user namespaces it
    became possible to create new namespaces without privilege.

    However the setns calls were relaxed to only require CAP_SYS_ADMIN in
    the user nameapce of the targed namespace.

    Which made the following nasty sequence possible.

    pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
    if (pid == 0) { /* child */
    system("mount --bind /home/me/passwd /etc/passwd");
    }
    else if (pid != 0) { /* parent */
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
    fd = open(path, O_RDONLY);
    setns(fd, 0);
    system("su -");
    }

    Prevent this possibility by requiring CAP_SYS_ADMIN
    in the current user namespace when joing all but the user namespace.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 Nov, 2012

1 commit

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

19 Nov, 2012

6 commits

  • Unsharing of the pid namespace unlike unsharing of other namespaces
    does not take affect immediately. Instead it affects the children
    created with fork and clone. The first of these children becomes the init
    process of the new pid namespace, the rest become oddball children
    of pid 0. From the point of view of the new pid namespace the process
    that created it is pid 0, as it's pid does not map.

    A couple of different semantics were considered but this one was
    settled on because it is easy to implement and it is usable from
    pam modules. The core reasons for the existence of unshare.

    I took a survey of the callers of pam modules and the following
    appears to be a representative sample of their logic.
    {
    setup stuff include pam
    child = fork();
    if (!child) {
    setuid()
    exec /bin/bash
    }
    waitpid(child);

    pam and other cleanup
    }

    As you can see there is a fork to create the unprivileged user
    space process. Which means that the unprivileged user space
    process will appear as pid 1 in the new pid namespace. Further
    most login processes do not cope with extraneous children which
    means shifting the duty of reaping extraneous child process to
    the creator of those extraneous children makes the system more
    comprehensible.

    The practical reason for this set of pid namespace semantics is
    that it is simple to implement and verify they work correctly.
    Whereas an implementation that requres changing the struct
    pid on a process comes with a lot more races and pain. Not
    the least of which is that glibc caches getpid().

    These semantics are implemented by having two notions
    of the pid namespace of a proces. There is task_active_pid_ns
    which is the pid namspace the process was created with
    and the pid namespace that all pids are presented to
    that process in. The task_active_pid_ns is stored
    in the struct pid of the task.

    Then there is the pid namespace that will be used for children
    that pid namespace is stored in task->nsproxy->pid_ns.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Pid namespaces are designed to be inescapable so verify that the
    passed in pid namespace is a child of the currently active
    pid namespace or the currently active pid namespace itself.

    Allowing the currently active pid namespace is important so
    the effects of an earlier setns can be cancelled.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • task_active_pid_ns(current) != current->ns_proxy->pid_ns will
    soon be allowed to support unshare and setns.

    The definition of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
    we create a child pid namespace of current->ns_proxy->pid_ns. However
    that leads to strange cases like trying to have a single process be
    init in multiple pid namespaces, which is racy and hard to think
    about.

    The definition of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
    we create a child pid namespace of task_active_pid_ns(current). While
    that seems less racy it does not provide any utility.

    Therefore define the semantics of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
    pid namespace creation fails. That is easy to implement and easy
    to think about.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Looking at pid_ns->nr_hashed is a bit simpler and it works for
    disjoint process trees that an unshare or a join of a pid_namespace
    may create.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Track the number of pids in the proc hash table. When the number of
    pids goes to 0 schedule work to unmount the kernel mount of proc.

    Move the mount of proc into alloc_pid when we allocate the pid for
    init.

    Remove the surprising calls of pid_ns_release proc in fork and
    proc_flush_task. Those code paths really shouldn't know about proc
    namespace implementation details and people have demonstrated several
    times that finding and understanding those code paths is difficult and
    non-obvious.

    Because of the call path detach pid is alwasy called with the
    rtnl_lock held free_pid is not allowed to sleep, so the work to
    unmounting proc is moved to a work queue. This has the side benefit
    of not blocking the entire world waiting for the unnecessary
    rcu_barrier in deactivate_locked_super.

    In the process of making the code clear and obvious this fixes a bug
    reported by Gao feng where we would leak a
    mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
    succeeded and copy_net_ns failed.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • - Capture the the user namespace that creates the pid namespace
    - Use that user namespace to test if it is ok to write to
    /proc/sys/kernel/ns_last_pid.

    Zhao Hongjiang noticed I was missing a put_user_ns
    in when destroying a pid_ns. I have foloded his patch into this one
    so that bisects will work properly.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Oct, 2012

1 commit

  • 'struct pid' is a "variable sized struct" - a header with an array of
    upids at the end.

    The size of the array depends on a level (depth) of pid namespaces. Now a
    level of pidns is not limited, so 'struct pid' can be more than one page.

    Looks reasonable, that it should be less than a page. MAX_PIS_NS_LEVEL is
    not calculated from PAGE_SIZE, because in this case it depends on
    architectures, config options and it will be reduced, if someone adds a
    new fields in struct pid or struct upid.

    I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
    "struct pid" and it's more than enough for all known for me use-cases.
    When someone finds a reasonable use case, we can add a config option or a
    sysctl parameter.

    In addition it will reduce the effect of another problem, when we have
    many nested namespaces and the oldest one starts dying.
    zap_pid_ns_processe will be called for each namespace and find_vpid will
    be called for each process in a namespace. find_vpid will be called
    minimum max_level^2 / 2 times. The reason of that is that when we found a
    bit in pidmap, we can't determine this pidns is top for this process or it
    isn't.

    vpid is a heavy operation, so a fork bomb, which create many nested
    namespace, can make a system inaccessible for a long time. For example my
    system becomes inaccessible for a few minutes with 4000 processes.

    [akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM]
    Signed-off-by: Andrew Vagin
    Acked-by: Oleg Nesterov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     

20 Oct, 2012

1 commit

  • free_pid_ns() operates in a recursive fashion:

    free_pid_ns(parent)
    put_pid_ns(parent)
    kref_put(&ns->kref, free_pid_ns);
    free_pid_ns

    thus if there was a huge nesting of namespaces the userspace may trigger
    avalanche calling of free_pid_ns leading to kernel stack exhausting and a
    panic eventually.

    This patch turns the recursion into an iterative loop.

    Based on a patch by Andrew Vagin.

    [akpm@linux-foundation.org: export put_pid_ns() to modules]
    Signed-off-by: Cyrill Gorcunov
    Cc: Andrew Vagin
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

03 Oct, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     

18 Sep, 2012

1 commit

  • The kernel doesn't check the pid for negative values, so if you try to
    write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic.

    The crash happens because the next pid is -1, and alloc_pidmap() will
    try to access to a nonexistent pidmap.

    map = &pid_ns->pidmap[pid/BITS_PER_PAGE];

    Signed-off-by: Andrew Vagin
    Acked-by: Cyrill Gorcunov
    Acked-by: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     

15 Aug, 2012

1 commit

  • There is a least one modular user so export free_pid_ns so modules can
    capture and use the pid namespace on the very rare occasion when it
    makes sense.

    Acked-by: David S. Miller
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 Jun, 2012

1 commit

  • Today we have a twofold bug. Sometimes release_task on pid == 1 in a pid
    namespace can run before other processes in a pid namespace have had
    release task called. With the result that pid_ns_release_proc can be
    called before the last proc_flus_task() is done using upid->ns->proc_mnt,
    resulting in the use of a stale pointer. This same set of circumstances
    can lead to waitpid(...) returning for a processes started with
    clone(CLONE_NEWPID) before the every process in the pid namespace has
    actually exited.

    To fix this modify zap_pid_ns_processess wait until all other processes in
    the pid namespace have exited, even EXIT_DEAD zombies.

    The delay_group_leader and related tests ensure that the thread gruop
    leader will be the last thread of a process group to be reaped, or to
    become EXIT_DEAD and self reap. With the change to zap_pid_ns_processes
    we get the guarantee that pid == 1 in a pid namespace will be the last
    task that release_task is called on.

    With pid == 1 being the last task to pass through release_task
    pid_ns_release_proc can no longer be called too early nor can wait return
    before all of the EXIT_DEAD tasks in a pid namespace have exited.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Oleg Nesterov
    Cc: Louis Rilling
    Cc: Mike Galbraith
    Acked-by: Pavel Emelyanov
    Tested-by: Andrew Wagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

01 Jun, 2012

2 commits

  • For those who doesn't need C/R functionality there is no need to control
    last pid, ie the pid for the next fork() call.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Force SIGCHLD handling to SIG_IGN so that signals are not generated and so
    that the children autoreap. This increases the parallelize and in general
    the speed of network namespace shutdown.

    Note self reaping childrean can exist past zap_pid_ns_processess but they
    will all be reaped before we allow the pid namespace init task with pid ==
    1 to be reaped.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Louis Rilling
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

29 Mar, 2012

1 commit

  • In the case of a child pid namespace, rebooting the system does not really
    makes sense. When the pid namespace is used in conjunction with the other
    namespaces in order to create a linux container, the reboot syscall leads
    to some problems.

    A container can reboot the host. That can be fixed by dropping the
    sys_reboot capability but we are unable to correctly to poweroff/
    halt/reboot a container and the container stays stuck at the shutdown time
    with the container's init process waiting indefinitively.

    After several attempts, no solution from userspace was found to reliabily
    handle the shutdown from a container.

    This patch propose to make the init process of the child pid namespace to
    exit with a signal status set to : SIGINT if the child pid namespace
    called "halt/poweroff" and SIGHUP if the child pid namespace called
    "reboot". When the reboot syscall is called and we are not in the initial
    pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
    "RESTART", and "RESTART2". Otherwise we return EINVAL.

    Returning EINVAL is also an easy way to check if this feature is supported
    by the kernel when invoking another 'reboot' option like CAD.

    By this way the parent process of the child pid namespace knows if it
    rebooted or not and can take the right decision.

    Test case:
    ==========

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include

    static int do_reboot(void *arg)
    {
    int *cmd = arg;

    if (reboot(*cmd))
    printf("failed to reboot(%d): %m\n", *cmd);
    }

    int test_reboot(int cmd, int sig)
    {
    long stack_size = 4096;
    void *stack = alloca(stack_size) + stack_size;
    int status;
    pid_t ret;

    ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
    if (ret < 0) {
    printf("failed to clone: %m\n");
    return -1;
    }

    if (wait(&status) < 0) {
    printf("unexpected wait error: %m\n");
    return -1;
    }

    if (!WIFSIGNALED(status)) {
    printf("child process exited but was not signaled\n");
    return -1;
    }

    if (WTERMSIG(status) != sig) {
    printf("signal termination is not the one expected\n");
    return -1;
    }

    return 0;
    }

    int main(int argc, char *argv[])
    {
    int status;

    status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
    if (status >= 0) {
    printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
    return 1;
    }
    printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");

    return 0;
    }

    [akpm@linux-foundation.org: tweak and add comments]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Daniel Lezcano
    Acked-by: Serge Hallyn
    Tested-by: Serge Hallyn
    Reviewed-by: Oleg Nesterov
    Cc: Michael Kerrisk
    Cc: "Eric W. Biederman"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     

24 Mar, 2012

1 commit

  • Change zap_pid_ns_processes() to use SEND_SIG_FORCED, it looks more
    clear compared to SEND_SIG_NOINFO which relies on from_ancestor_ns logic
    send_signal().

    It is also more efficient if we need to kill a lot of tasks because it
    doesn't alloc sigqueue.

    While at it, add the __fatal_signal_pending(task) check as a minor
    optimization.

    Signed-off-by: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Anton Vorontsov
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

13 Jan, 2012

1 commit

  • The sysctl works on the current task's pid namespace, getting and setting
    its last_pid field.

    Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible
    to create a task with desired pid value. This ability is required badly
    for the checkpoint/restore in userspace.

    This approach suits all the parties for now.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

24 Mar, 2011

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

13 Mar, 2010

1 commit

  • zap_pid_ns_processes() uses force_sig(SIGKILL) to ensure SIGKILL will be
    delivered to sub-namespace inits as well. This is correct, but we are
    going to change force_sig_info() semantics. See
    http://bugzilla.kernel.org/show_bug.cgi?id=15395#c31

    We can use send_sig_info(SEND_SIG_NOINFO) instead, since
    614c517d7c00af1b26ded20646b329397d6f51a1 ("signals: SEND_SIG_NOINFO should
    be considered as SI_FROMUSER()") SEND_SIG_NOINFO means "from user" and
    therefore send_signal() will get the correct from_ancestor_ns = T flag.

    Signed-off-by: Oleg Nesterov
    Acked-by: Serge Hallyn
    Acked-by: Linus Torvalds
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

24 Sep, 2009

1 commit

  • CLONE_PARENT was used to implement an older threading model. For
    consistency with the CLONE_THREAD check in copy_pid_ns(), disable
    CLONE_PARENT with CLONE_NEWPID, at least until the required semantics of
    pid namespaces are clear.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Roland McGrath
    Acked-by: Serge Hallyn
    Cc: Oren Laadan
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu