19 Apr, 2011

1 commit

  • next_pidmap() just quietly accepted whatever 'last' pid that was passed
    in, which is not all that safe when one of the users is /proc.

    Admittedly the proc code should do some sanity checking on the range
    (and that will be the next commit), but that doesn't mean that the
    helper functions should just do that pidmap pointer arithmetic without
    checking the range of its arguments.

    So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1"
    doesn't really matter, the for-loop does check against the end of the
    pidmap array properly (it's only the actual pointer arithmetic overflow
    case we need to worry about, and going one bit beyond isn't going to
    overflow).

    [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]

    Reported-by: Tavis Ormandy
    Analyzed-by: Robert Święcki
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Mar, 2011

1 commit


20 Aug, 2010

2 commits

  • find_task_by_vpid() says "Must be called under rcu_read_lock().". But due to
    commit 3120438 "rcu: Disable lockdep checking in RCU list-traversal primitives",
    we are currently unable to catch "find_task_by_vpid() with tasklist_lock held
    but RCU lock not held" errors due to the RCU-lockdep checks being
    suppressed in the RCU variants of the struct list_head traversals.
    This commit therefore places an explicit check for being in an RCU
    read-side critical section in find_task_by_pid_ns().

    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    kernel/pid.c:386 invoked rcu_dereference_check() without protection!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 1
    1 lock held by rc.sysinit/1102:
    #0: (tasklist_lock){.+.+..}, at: [] sys_setpgid+0x40/0x160

    stack backtrace:
    Pid: 1102, comm: rc.sysinit Not tainted 2.6.35-rc3-dirty #1
    Call Trace:
    [] lockdep_rcu_dereference+0x94/0xb0
    [] find_task_by_pid_ns+0x6d/0x70
    [] find_task_by_vpid+0x18/0x20
    [] sys_setpgid+0x47/0x160
    [] sysenter_do_call+0x12/0x36

    Commit updated to use a new rcu_lockdep_assert() exported API rather than
    the old internal __do_rcu_dereference().

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Tetsuo Handa
     
  • This avoids warnings from missing __rcu annotations
    in the rculist implementation, making it possible to
    use the same lists in both RCU and non-RCU cases.

    We can add rculist annotations later, together with
    lockdep support for rculist, which is missing as well,
    but that may involve changing all the users.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     

11 Aug, 2010

2 commits

  • alloc_pidmap() calculates max_scan so that if the initial offset != 0 we
    inspect the first map->page twice. This is correct, we want to find the
    unused bits < offset in this bitmap block. Add the comment.

    But it doesn't make any sense to stop the find_next_offset() loop when we
    are looking into this map->page for the second time. We have already
    already checked the bits >= offset during the first attempt, it is fine to
    do this again, no matter if we succeed this time or not.

    Remove this hard-to-understand code. It optimizes the very unlikely case
    when we are going to fail, but slows down the more likely case.

    Signed-off-by: Oleg Nesterov
    Cc: Salman Qazi
    Cc: Ingo Molnar
    Cc: Sukadev Bhattiprolu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A program that repeatedly forks and waits is susceptible to having the
    same pid repeated, especially when it competes with another instance of
    the same program. This is really bad for bash implementation.
    Furthermore, many shell scripts assume that pid numbers will not be used
    for some length of time.

    Race Description:

    A B

    // pid == offset == n // pid == offset == n + 1
    test_and_set_bit(offset, map->page)
    test_and_set_bit(offset, map->page);
    pid_ns->last_pid = pid;
    pid_ns->last_pid = pid;
    // pid == n + 1 is freed (wait())

    // Next fork()...
    last = pid_ns->last_pid; // == n
    pid = last + 1;

    Code to reproduce it (Running multiple instances is more effective):

    #include
    #include
    #include
    #include
    #include
    #include

    // The distance mod 32768 between two pids, where the first pid is expected
    // to be smaller than the second.
    int PidDistance(pid_t first, pid_t second) {
    return (second + 32768 - first) % 32768;
    }

    int main(int argc, char* argv[]) {
    int failed = 0;
    pid_t last_pid = 0;
    int i;
    printf("%d\n", sizeof(pid_t));
    for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
    printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
    fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
    exit(1);
    }
    if (pid == 0) {
    // Child
    exit(child_exit_code);
    } else {
    // Parent
    if (i > 0) {
    int distance = PidDistance(last_pid, pid);
    if (distance == 0 || distance > 30000) {
    fprintf(stderr,
    "Unexpected pid sequence: previous fork: pid=%d, "
    "current fork: pid=%d for iteration=%d.\n",
    last_pid, pid, i);
    failed = 1;
    }
    }
    last_pid = pid;
    int status;
    int reaped = wait(&status);
    if (reaped != pid) {
    fprintf(stderr,
    "Wait return value: expected pid=%d, "
    "got %d, iteration %d\n",
    pid, reaped, i);
    failed = 1;
    } else if (WEXITSTATUS(status) != child_exit_code) {
    fprintf(stderr,
    "Unexpected exit status %x, iteration %d\n",
    WEXITSTATUS(status), i);
    failed = 1;
    }
    }
    }
    exit(failed);
    }

    Thanks to Ted Tso for the key ideas of this implementation.

    Signed-off-by: Salman Qazi
    Cc: Ingo Molnar
    Cc: Theodore Ts'o
    Cc: Peter Zijlstra
    Cc: Sukadev Bhattiprolu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Salman
     

28 May, 2010

1 commit

  • On a system with a substantial number of processors, the early default
    pid_max of 32k will not be enough. A system with 1664 CPU's, there are
    25163 processes started before the login prompt. It's estimated that with
    2048 CPU's we will pass the 32k limit. With 4096, we'll reach that limit
    very early during the boot cycle, and processes would stall waiting for an
    available pid.

    This patch increases the early maximum number of pids available, and
    increases the minimum number of pids that can be set during runtime.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Hedi Berriche
    Signed-off-by: Mike Travis
    Signed-off-by: Robin Holt
    Acked-by: Linus Torvalds
    Cc: Ingo Molnar
    Cc: Pavel Machek
    Cc: Alan Cox
    Cc: Greg KH
    Cc: Rik van Riel
    Cc: John Stoffel
    Cc: Jack Steiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hedi Berriche
     

14 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     

07 Mar, 2010

1 commit

  • tasklist_lock does protect the task and its pid, it can't go away. The
    problem is that find_pid_ns() itself is unsafe without rcu lock, it can
    race with copy_process()->free_pid(any_pid).

    Protecting copy_process()->free_pid(any_pid) with tasklist_lock would make
    it possible to call find_task_by_pid_ns() under tasklist safely, but we
    don't do so because we are trying to get rid of the read_lock sites of
    tasklist_lock.

    Signed-off-by: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

04 Mar, 2010

1 commit

  • Lockdep-RCU commit d11c563d exported tasklist_lock, which is not
    a good thing. This patch instead exports a function that uses
    lockdep to check whether tasklist_lock is held.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Christoph Hellwig
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

25 Feb, 2010

1 commit

  • Update the rcu_dereference() usages to take advantage of the new
    lockdep-based checking.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

16 Dec, 2009

2 commits


22 Sep, 2009

1 commit

  • This is being done by allowing boot time allocations to specify that they
    may want a sub-page sized amount of memory.

    Overall this seems more consistent with the other hash table allocations,
    and allows making two supposedly mm-only variables really mm-only
    (nr_{kernel,all}_pages).

    Signed-off-by: Jan Beulich
    Cc: Ingo Molnar
    Cc: "Eric W. Biederman"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

10 Jul, 2009

1 commit


30 Jun, 2009

1 commit


19 Jun, 2009

1 commit

  • find_task_by_pid_type_ns is only used to implement find_task_by_vpid and
    find_task_by_pid_ns, but both of them pass PIDTYPE_PID as first argument.
    So just fold find_task_by_pid_type_ns into find_task_by_pid_ns and use
    find_task_by_pid_ns to implement find_task_by_vpid.

    While we're at it also remove the exports for find_task_by_pid_ns and
    find_task_by_vpid - we don't have any modular callers left as the only
    modular caller of he old pre pid namespace find_task_by_pid (gfs2) was
    switched to pid_task which operates on a struct pid pointer instead of a
    pid_t. Given the confusion about pid_t values vs namespace that's
    generally the better option anyway and I think we're better of restricting
    modules to do it that way.

    Signed-off-by: Christoph Hellwig
    Cc: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

03 Apr, 2009

2 commits

  • Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.

    task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.

    As for "special" pids, vnr/nr_ns helpers always need rcu. However, if
    task != current, they are unsafe even under rcu lock, we can't trust
    task->group_leader without the special checks.

    And almost every helper has a callsite which needs a fix.

    Also, it is a bit annoying that the implementations of, say,
    task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".

    This patch introduces the new helper, __task_pid_nr_ns(), which is always
    safe to use, and turns all other helpers into the trivial wrappers.

    After this I'll send another patch which converts task_tgid_xxx() as well,
    they're are a bit special.

    Signed-off-by: Oleg Nesterov
    Cc: Louis Rilling
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sys_wait4() does get_pid(task_pgrp(current)), this is not safe. We can
    add rcu lock/unlock around, but we already have get_task_pid() which can
    be improved to handle the special pids in more reliable manner.

    Signed-off-by: Oleg Nesterov
    Cc: Louis Rilling
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Jan, 2009

1 commit

  • Currently task_active_pid_ns is not safe to call after a task becomes a
    zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By
    reading the pid namespace from the pid of the task we can trivially solve
    this problem at the cost of one extra memory read in what should be the
    same cacheline as we read the namespace from.

    When moving things around I have made task_active_pid_ns out of line
    because keeping it in pid_namespace.h would require adding includes of
    pid.h and sched.h that I don't think we want.

    This change does make task_active_pid_ns unsafe to call during
    copy_process until we attach a pid on the task_struct which seems to be a
    reasonable trade off.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Bastian Blank
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

06 Jan, 2009

1 commit


26 Jul, 2008

2 commits


19 May, 2008

1 commit

  • Move rcu-protected lists from list.h into a new header file rculist.h.

    This is done because list are a very used primitive structure all over the
    kernel and it's currently impossible to include other header files in this
    list.h without creating some circular dependencies.

    For example, list.h implements rcu-protected list and uses rcu_dereference()
    without including rcupdate.h. It actually compiles because users of
    rcu_dereference() are macros. Others RCU functions could be used too but
    aren't probably because of this.

    Therefore this patch creates rculist.h which includes rcupdates without to
    many changes/troubles.

    Signed-off-by: Franck Bui-Huu
    Acked-by: Paul E. McKenney
    Acked-by: Josh Triplett
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Franck Bui-Huu
     

30 Apr, 2008

4 commits

  • Based on Eric W. Biederman's idea.

    Without tasklist_lock held task_session()/task_pgrp() can return NULL if the
    caller races with setprgp()/setsid() which does detach_pid() + attach_pid().
    This can happen even if task == current.

    Intoduce the new helper, change_pid(), which should be used instead. This way
    the caller always sees the special pid != NULL, either old or new.

    Also change the prototype of attach_pid(), it always returns 0 and nobody
    check the returned value.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Based on Eric W. Biederman's idea.

    Unless task == current, without tasklist_lock held task_session()/task_pgrp()
    can return NULL if the caller races with de_thread() which switches the group
    leader.

    Change transfer_pid() to not clear old->pids[type].pid for the old leader.
    This means that its .pid can point to "nowhere", but this is already true for
    sub-threads, and the old leader is not group_leader() any longer. IOW, with
    or without this change we can't trust task's special pids unless it is the
    group leader.

    With this change the following code

    rcu_read_lock();
    task = find_task_by_xxx();
    do_something(task_pgrp(task), task_session(task));
    rcu_read_unlock();

    can't race with exec and hit the NULL pid.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There are some places that are known to operate on tasks'
    global pids only:

    * the rest_init() call (called on boot)
    * the kgdb's getthread
    * the create_kthread() (since the kthread is run in init ns)

    So use the find_task_by_pid_ns(..., &init_pid_ns) there
    and schedule the find_task_by_pid for removal.

    [sukadev@us.ibm.com: Fix warning in kernel/pid.c]
    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The callers of free_pidmap() pass 2 members of "struct upid", we can just
    pass "struct upid *" instead. Shaves off 10 bytes from pid.o.

    Also, simplify the alloc_pid's "out_free:" error path a little bit. This
    way it looks more clear which subset of pid->numbers[] we are freeing.

    Signed-off-by: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Cc :Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Feb, 2008

3 commits

  • [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • pid_vnr returns the user space pid with respect to the pid namespace the
    struct pid was allocated in. What we want before we return a pid to user
    space is the user space pid with respect to the pid namespace of current.

    pid_vnr is a very nice optimization but because it isn't quite what we want
    it is easy to use pid_vnr at times when we aren't certain the struct pid
    was allocated in our pid namespace.

    Currently this describes at least tiocgpgrp and tiocgsid in ttyio.c the
    parent process reported in the core dumps and the parent process in
    get_signal_to_deliver.

    So unless the performance impact is huge having an interface that does what
    we want instead of always what we want should be much more reliable and
    much less error prone.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Just like with the user namespaces, move the namespace management code into
    the separate .c file and mark the (already existing) PID_NS option as "depend
    on NAMESPACES"

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Pavel Emelyanov
    Acked-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

08 Feb, 2008

1 commit

  • The gl_owner_pid field is used to get the lock owning task by its pid, so make
    it in a proper manner, i.e. by using the struct pid pointer and pid_task()
    function.

    The pid_task() becomes exported for the gfs2 module.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Acked-by: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

15 Nov, 2007

1 commit

  • This is my trivial patch to swat innumerable little bugs with a single
    blow.

    After some intensive review (my apologies for not having gotten to this
    sooner) what we have looks like a good base to build on with the current
    pid namespace code but it is not complete, and it is still much to simple
    to find issues where the kernel does the wrong thing outside of the initial
    pid namespace.

    Until the dust settles and we are certain we have the ABI and the
    implementation is as correct as humanly possible let's keep process ID
    namespaces behind CONFIG_EXPERIMENTAL.

    Allowing us the option of fixing any ABI or other bugs we find as long as
    they are minor.

    Allowing users of the kernel to avoid those bugs simply by ensuring their
    kernel does not have support for multiple pid namespaces.

    [akpm@linux-foundation.org: coding-style cleanups]
    Signed-off-by: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Adrian Bunk
    Cc: Jeremy Fitzhardinge
    Cc: Kir Kolyshkin
    Cc: Kirill Korotaev
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

20 Oct, 2007

7 commits

  • Since these are expanded into call to pid_nr_ns() anyway, it's OK to move
    the whole routine out-of-line. This is a cheap way to save ~100 bytes from
    vmlinux. Together with the previous two patches, it saves half-a-kilo from
    the vmlinux.

    Un-inline other (currently inlined) functions must be done with additional
    performance testing.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its
    id, depending on whic id - global or virtual - is used.

    The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the
    stack to call another function - find_pid_ns(). It turned out, that this
    dereference together with the push itself cause the kernel text size to
    grow too much.

    Move all these out-of-line. Together with the previous patch this saves a
    bit less that 400 bytes from .text section.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Since we've switched from using pid->nr to pid->upids->nr some
    fields on struct pid are no longer needed

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The find_task_by_something is a set of macros are used to find task by pid
    depending on what kind of pid is proposed - global or virtual one. All of
    them are wrappers above the most generic one - find_task_by_pid_type_ns() -
    and just substitute some args for it.

    It turned out, that dereferencing the current->nsproxy->pid_ns construction
    and pushing one more argument on the stack inline cause kernel text size to
    grow.

    This patch moves all this stuff out-of-line into kernel/pid.c. Together
    with the next patch it saves a bit less than 400 bytes from the .text
    section.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Terminate all processes in a namespace when the reaper of the namespace is
    exiting. We do this by walking the pidmap of the namespace and sending
    SIGKILL to all processes.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • This will help fixing memory leaks due to bad reference counting.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
    create a new struct pid for the new process. Allocate pid_t's for the new
    process in the new pid namespace and all ancestor pid namespaces. Make the
    newly cloned process the session and process group leader.

    Since the active pid namespace is special and expected to be the first entry
    in pid->upid_list, preserve the order of pid namespaces.

    The size of 'struct pid' is dependent on the the number of pid namespaces the
    process exists in, so we use multiple pid-caches'. Only one pid cache is
    created during system startup and this used by processes that exist only in
    init_pid_ns.

    When a process clones its pid namespace, we create additional pid caches as
    necessary and use the pid cache to allocate 'struct pids' for that depth.

    Note, that with this patch the newly created namespace won't work, since the
    rest of the kernel still uses global pids, but this is to be fixed soon. Init
    pid namespace still works.

    [oleg@tv-sign.ru: merge fix]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov