04 Jul, 2013

1 commit

  • copy_process() adds the new child to thread_group/init_task.tasks list and
    then does attach_pid(child, PIDTYPE_PID). This means that the lockless
    next_thread() or next_task() can see this thread with the wrong pid. Say,
    "ls /proc/pid/task" can list the same inode twice.

    We could move attach_pid(child, PIDTYPE_PID) up, but in this case
    find_task_by_vpid() can find the new thread before it was fully
    initialized.

    And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
    copy_process() initializes child->pids[*].pid first, then calls
    attach_pid() to insert the task into the pid->tasks list.

    attach_pid() no longer need the "struct pid*" argument, it is always
    called after pid_link->pid was already set.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Pavel Emelyanov
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

26 Dec, 2012

1 commit

  • Oleg pointed out that in a pid namespace the sequence.
    - pid 1 becomes a zombie
    - setns(thepidns), fork,...
    - reaping pid 1.
    - The injected processes exiting.

    Can lead to processes attempting access their child reaper and
    instead following a stale pointer.

    That waitpid for init can return before all of the processes in
    the pid namespace have exited is also unfortunate.

    Avoid these problems by disabling the allocation of new pids in a pid
    namespace when init dies, instead of when the last process in a pid
    namespace is reaped.

    Pointed-out-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 May, 2011

1 commit


19 Apr, 2011

1 commit

  • next_pidmap() just quietly accepted whatever 'last' pid that was passed
    in, which is not all that safe when one of the users is /proc.

    Admittedly the proc code should do some sanity checking on the range
    (and that will be the next commit), but that doesn't mean that the
    helper functions should just do that pidmap pointer arithmetic without
    checking the range of its arguments.

    So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1"
    doesn't really matter, the for-loop does check against the end of the
    pidmap array properly (it's only the actual pointer arithmetic overflow
    case we need to worry about, and going one bit beyond isn't going to
    overflow).

    [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]

    Reported-by: Tavis Ormandy
    Analyzed-by: Robert Święcki
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Mar, 2011

1 commit


24 Mar, 2011

1 commit

  • This patchset is a cleanup and a preparation to unshare the pid namespace.
    These prerequisites prepare for Eric's patchset to give a file descriptor
    to a namespace and join an existing namespace.

    This patch:

    It turns out that the existing assignment in copy_process of the
    child_reaper can handle the initial assignment of child_reaper we just
    need to generalize the test in kernel/fork.c

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

09 Jan, 2009

1 commit

  • A current problem with the pid namespace is that it is easy to do pid
    related work after exit_task_namespaces which drops the nsproxy pointer.

    However if we are doing pid namespace related work we are always operating
    on some struct pid which retains the pid_namespace pointer of the pid
    namespace it was allocated in.

    So provide ns_of_pid which allows us to find the pid namespace a pid was
    allocated in.

    Using this we have the needed infrastructure to do pid namespace related
    work at anytime we have a struct pid, removing the chance of accidentally
    having a NULL pointer dereference when accessing current->nsproxy.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Bastian Blank
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

04 Dec, 2008

1 commit

  • Impact: macro side-effects fix

    This patch adds parenthesis around 'pid' in the do_each_pid_task
    macro to allow callers to pass in more complex parameters.

    e.g. do_each_pid_task(*pid, type, task)

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

21 Aug, 2008

1 commit

  • When user calls sys_setpriority(PRIO_PGRP ...) on a NPTL style multi-LWP
    process, only the task leader of the process is affected, all other
    sibling LWP threads didn't receive the setting. The problem was that the
    iterator used in sys_setpriority() only iteartes over one task for each
    process, ignoring all other sibling thread.

    Introduce a new macro do_each_pid_thread / while_each_pid_thread to walk
    each thread of a process. Convert 4 call sites in {set/get}priority and
    ioprio_{set/get}.

    Signed-off-by: Ken Chen
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

26 Jul, 2008

3 commits

  • It seems to me that it was a mistake marking this function as deprecated
    and scheduling it for removal, rather than resolutely removing it after
    the last caller's death.

    Anyway - better late, then never.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This one had the only users so far - the kill_proc, which is removed, so
    drop this (invalid in namespaced world) call too.

    And of course - erase all references on it from comments.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When struct pid is built on a 64 bit platform gcc has to insert padding to
    maintain the correct alignment, by simply reordering its members the
    memory usage shrinks from 88 bytes to 80.

    I've successfully run with this patch on my desktop AMD64 machine.

    There are no significant kernel size changes to a default config.X86_64
    on the latest git v2.6.26-rc1

    text data bss dec hex filename
    5404828 976760 734280 7115868 6c945c vmlinux
    5404811 976760 734280 7115851 6c944b vmlinux.pid-patch

    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     

30 Apr, 2008

2 commits

  • These values represent the nesting level of a namespace and pids living in it,
    and it's always non-negative.

    Turning this from int to unsigned int saves some space in pid.c (11 bytes on
    x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit.
    E.g. on ia64 this removes the sign extension calls, which compiler adds to
    optimize access to pid->nubers[ns->level].

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Based on Eric W. Biederman's idea.

    Without tasklist_lock held task_session()/task_pgrp() can return NULL if the
    caller races with setprgp()/setsid() which does detach_pid() + attach_pid().
    This can happen even if task == current.

    Intoduce the new helper, change_pid(), which should be used instead. This way
    the caller always sees the special pid != NULL, either old or new.

    Also change the prototype of attach_pid(), it always returns 0 and nobody
    check the returned value.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

14 Feb, 2008

1 commit


09 Feb, 2008

3 commits

  • There is a window when de_thread() switches the leader and drops
    tasklist_lock. In that window do_each_pid_task(PIDTYPE_PID) finds both new
    and old leaders.

    The problem is pretty much theoretical and probably can be ignored. Currently
    the only users of do_each_pid_task(PIDTYPE_PID) are send_sigio/send_sigurg, so
    they can send the signal to the same process twice.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • pid_vnr returns the user space pid with respect to the pid namespace the
    struct pid was allocated in. What we want before we return a pid to user
    space is the user space pid with respect to the pid namespace of current.

    pid_vnr is a very nice optimization but because it isn't quite what we want
    it is easy to use pid_vnr at times when we aren't certain the struct pid
    was allocated in our pid namespace.

    Currently this describes at least tiocgpgrp and tiocgsid in ttyio.c the
    parent process reported in the core dumps and the parent process in
    get_signal_to_deliver.

    So unless the performance impact is huge having an interface that does what
    we want instead of always what we want should be much more reliable and
    much less error prone.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Just like with the user namespaces, move the namespace management code into
    the separate .c file and mark the (already existing) PID_NS option as "depend
    on NAMESPACES"

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Pavel Emelyanov
    Acked-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

20 Oct, 2007

7 commits

  • The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its
    id, depending on whic id - global or virtual - is used.

    The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the
    stack to call another function - find_pid_ns(). It turned out, that this
    dereference together with the push itself cause the kernel text size to
    grow too much.

    Move all these out-of-line. Together with the previous patch this saves a
    bit less that 400 bytes from .text section.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Since we've switched from using pid->nr to pid->upids->nr some
    fields on struct pid are no longer needed

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Terminate all processes in a namespace when the reaper of the namespace is
    exiting. We do this by walking the pidmap of the namespace and sending
    SIGKILL to all processes.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • When searching the task by numerical id on may need to find it using global
    pid (as it is done now in kernel) or by its virtual id, e.g. when sending a
    signal to a task from one namespace the sender will specify the task's virtual
    id and we should find the task by this value.

    [akpm@linux-foundation.org: fix gfs2 linkage]
    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When showing pid to user or getting the pid numerical id for in-kernel use the
    value of this id may differ depending on the namespace.

    This set of helpers is used to get the global pid nr, the virtual (i.e. seen
    by task in its namespace) nr and the nr as it is seen from the specified
    namespace.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Each struct upid element of struct pid has to be initialized properly, i.e.
    its nr mst be allocated from appropriate pidmap and ns set to appropriate
    namespace.

    When allocating a new pid, we need to know the namespace this pid will live
    in, so the additional argument is added to alloc_pid().

    On the other hand, the rest of the kernel still uses the pid->nr and
    pid->pid_chain fields, so these ones are still initialized, but this will be
    removed soon.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Since task will be visible from different pid namespaces each of them have to
    be addressed by multiple pids. struct upid is to store the information about
    which id refers to which namespace.

    The constuciton looks like this. Each struct pid carried the reference
    counter and the list of tasks attached to this pid. At its end it has a
    variable length array of struct upid-s. Each struct upid has a numerical id
    (pid itself), pointer to the namespace, this ID is valid in and is hashed into
    a pid_hash for searching the pids.

    The nr and pid_chain fields are kept in struct pid for a while to make kernel
    still work (no patch initialize the upids yet), but it will be removed at the
    end of this series when we switch to upids completely.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

11 May, 2007

2 commits

  • Statically initialize a struct pid for the swapper process (pid_t == 0) and
    attach it to init_task. This is needed so task_pid(), task_pgrp() and
    task_session() interfaces work on the swapper process also.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc: Herbert Poetzl
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • attach_pid() currently takes a pid_t and then uses find_pid() to find the
    corresponding struct pid. Sometimes we already have the struct pid. We can
    then skip find_pid() if attach_pid() were to take a struct pid parameter.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

13 Feb, 2007

1 commit


09 Dec, 2006

1 commit

  • Add a per pid_namespace child-reaper. This is needed so processes are reaped
    within the same pid space and do not spill over to the parent pid space. Its
    also needed so containers preserve existing semantic that pid == 1 would reap
    orphaned children.

    This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

03 Oct, 2006

1 commit


02 Oct, 2006

5 commits

  • proc_pid_make_inode:

    ei->pid = get_pid(task_pid(task));

    I think this is not safe. get_pid() can be preempted after checking "pid
    != NULL". Then the task exits, does detach_pid(), and RCU frees the pid.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • I think it is hardly possible to read the current do_each_task_pid(). The
    new version is much simpler and makes the code smaller.

    Only the do_each_task_pid change is tested, the do_each_pid_task isn't.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • As we stop storing pid_t's and move to storing struct pid *. We need a way to
    get the pid_t from the struct pid to report to user space what we have stored.

    Having a clean well defined way to do this is especially important as we move
    to multiple pid spaces as may need to report a different value to the caller
    depending on which pid space the caller is in.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • To avoid pid rollover confusion the kernel needs to work with struct pid *
    instead of pid_t. Currently there is not an iterator that walks through all
    of the tasks of a given pid type starting with a struct pid. This prevents us
    replacing some pid_t instances with struct pid. So this patch adds
    do_each_pid_task which walks through the set of task for a given pid type
    starting with a struct pid.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The problem: An opendir, readdir, closedir sequence can fail to report
    process ids that are continually in use throughout the sequence of system
    calls. For this race to trigger the process that proc_pid_readdir stops at
    must exit before readdir is called again.

    This can cause ps to fail to report processes, and it is in violation of
    posix guarantees and normal application expectations with respect to
    readdir.

    Currently there is no way to work around this problem in user space short
    of providing a gargantuan buffer to user space so the directory read all
    happens in on system call.

    This patch implements the normal directory semantics for proc, that
    guarantee that a directory entry that is neither created nor destroyed
    while reading the directory entry will be returned. For directory that are
    either created or destroyed during the readdir you may or may not see them.
    Furthermore you may seek to a directory offset you have previously seen.

    These are the guarantee that ext[23] provides and that posix requires, and
    more importantly that user space expects. Plus it is a simple semantic to
    implement reliable service. It is just a matter of calling readdir a
    second time if you are wondering if something new has show up.

    These better semantics are implemented by scanning through the pids in
    numerical order and by making the file offset a pid plus a fixed offset.

    The pid scan happens on the pid bitmap, which when you look at it is
    remarkably efficient for a brute force algorithm. Given that a typical
    cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
    are only 40 cache lines for the entire 32K pid space. A typical system
    will have 100 pids or more so this is actually fewer cache lines we have to
    look at to scan a linked list, and the worst case of having to scan the
    entire pid bitmap is pretty reasonable.

    If we need something more efficient we can go to a more efficient data
    structure for indexing the pids, but for now what we have should be
    sufficient.

    In addition this takes no additional locks and is actually less code than
    what we are doing now.

    Also another very subtle bug in this area has been fixed. It is possible
    to catch a task in the middle of de_thread where a thread is assuming the
    thread of it's thread group leader. This patch carefully handles that case
    so if we hit it we don't fail to return the pid, that is undergoing the
    de_thread dance.

    Thanks to KAMEZAWA Hiroyuki for
    providing the first fix, pointing this out and working on it.

    [oleg@tv-sign.ru: fix it]
    Signed-off-by: Eric W. Biederman
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Oleg Nesterov
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

27 Sep, 2006

1 commit

  • In de_thread we move pids from one process to another, a rather ugly case.
    The function transfer_pid makes it clear what we are doing, and makes the
    action atomic. This is useful we ever want to atomically traverse the
    process group and session lists, in a rcu safe manner.

    Even if the atomic properties this change should be a win as transfer_pid
    should be less code to execute than executing both attach_pid and
    detach_pid, and this should make de_thread slightly smaller as only a
    single function call needs to be emitted. The only downside is that the
    code might be slower to execute as the odds are against transfer_pid being
    in cache.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

01 Apr, 2006

1 commit

  • Simplifies the code, reduces the need for 4 pid hash tables, and makes the
    code more capable.

    In the discussions I had with Oleg it was felt that to a large extent the
    cleanup itself justified the work. With struct pid being dynamically
    allocated meant we could create the hash table entry when the pid was
    allocated and free the hash table entry when the pid was freed. Instead of
    playing with the hash lists when ever a process would attach or detach to a
    process.

    For myself the fact that it gave what my previous task_ref patch gave for free
    with simpler code was a big win. The problem is that if you hold a reference
    to struct task_struct you lock in 10K of low memory. If you do that in a user
    controllable way like /proc does, with an unprivileged but hostile user space
    application with typical resource limits of 1000 fds and 100 processes I can
    trigger the OOM killer by consuming all of low memory with task structs, on a
    machine wight 1GB of low memory.

    If I instead hold a reference to struct pid which holds a pointer to my
    task_struct, I don't suffer from that problem because struct pid is 2 orders
    of magnitude smaller. In fact struct pid is small enough that most other
    kernel data structures dwarf it, so simply limiting the number of referring
    data structures is enough to prevent exhaustion of low memory.

    This splits the current struct pid into two structures, struct pid and struct
    pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
    struct pid_link is the per process linkage into the hash tables and lives in
    struct task_struct. struct pid is given an indepedent lifetime, and holds
    pointers to each of the pid types.

    The independent life of struct pid simplifies attach_pid, and detach_pid,
    because we are always manipulating the list of pids and not the hash table.
    In addition in giving struct pid an indpendent life it makes the concept much
    more powerful.

    Kernel data structures can now embed a struct pid * instead of a pid_t and
    not suffer from pid wrap around problems or from keeping unnecessarily
    large amounts of memory allocated.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

29 Mar, 2006

2 commits

  • This patch kills PIDTYPE_TGID pid_type thus saving one hash table in
    kernel/pid.c and speeding up subthreads create/destroy a bit. It is also a
    preparation for the further tref/pids rework.

    This patch adds 'struct list_head thread_group' to 'struct task_struct'
    instead.

    We don't detach group leader from PIDTYPE_PID namespace until another
    thread inherits it's ->pid == ->tgid, so we are safe wrt premature
    free_pidmap(->tgid) call.

    Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID).
    Should the need arise, we can use find_task_by_pid()->group_leader.

    Signed-off-by: Oleg Nesterov
    Acked-By: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • switch_exec_pids is only called from de_thread by way of exec, and it is
    only called when we are exec'ing from a non thread group leader.

    Currently switch_exec_pids gives the leader the pid of the thread and
    unhashes and rehashes all of the process groups. The leader is already in
    the EXIT_DEAD state so no one cares about it's pids. The only concern for
    the leader is that __unhash_process called from release_task will function
    correctly. If we don't touch the leader at all we know that
    __unhash_process will work fine so there is no need to touch the leader.

    For the task becomming the thread group leader, we just need to give it the
    pid of the old thread group leader, add it to the task list, and attach it
    to the session and the process group of the thread group.

    Currently de_thread is also adding the task to the task list which is just
    silly.

    Currently the only leader of __detach_pid besides detach_pid is
    switch_exec_pids because of the ugly extra work that was being
    performed.

    So this patch removes switch_exec_pids because it is doing too much, it is
    creating an unnecessary special case in pid.c, duing work duplicated in
    de_thread, and generally obscuring what it is going on.

    The necessary work is added to de_thread, and it seems to be a little
    clearer there what is going on.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman