20 Oct, 2007

15 commits

  • Since these are expanded into call to pid_nr_ns() anyway, it's OK to move
    the whole routine out-of-line. This is a cheap way to save ~100 bytes from
    vmlinux. Together with the previous two patches, it saves half-a-kilo from
    the vmlinux.

    Un-inline other (currently inlined) functions must be done with additional
    performance testing.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its
    id, depending on whic id - global or virtual - is used.

    The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the
    stack to call another function - find_pid_ns(). It turned out, that this
    dereference together with the push itself cause the kernel text size to
    grow too much.

    Move all these out-of-line. Together with the previous patch this saves a
    bit less that 400 bytes from .text section.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Since we've switched from using pid->nr to pid->upids->nr some
    fields on struct pid are no longer needed

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The find_task_by_something is a set of macros are used to find task by pid
    depending on what kind of pid is proposed - global or virtual one. All of
    them are wrappers above the most generic one - find_task_by_pid_type_ns() -
    and just substitute some args for it.

    It turned out, that dereferencing the current->nsproxy->pid_ns construction
    and pushing one more argument on the stack inline cause kernel text size to
    grow.

    This patch moves all this stuff out-of-line into kernel/pid.c. Together
    with the next patch it saves a bit less than 400 bytes from the .text
    section.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Terminate all processes in a namespace when the reaper of the namespace is
    exiting. We do this by walking the pidmap of the namespace and sending
    SIGKILL to all processes.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • This will help fixing memory leaks due to bad reference counting.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
    create a new struct pid for the new process. Allocate pid_t's for the new
    process in the new pid namespace and all ancestor pid namespaces. Make the
    newly cloned process the session and process group leader.

    Since the active pid namespace is special and expected to be the first entry
    in pid->upid_list, preserve the order of pid namespaces.

    The size of 'struct pid' is dependent on the the number of pid namespaces the
    process exists in, so we use multiple pid-caches'. Only one pid cache is
    created during system startup and this used by processes that exist only in
    init_pid_ns.

    When a process clones its pid namespace, we create additional pid caches as
    necessary and use the pid cache to allocate 'struct pids' for that depth.

    Note, that with this patch the newly created namespace won't work, since the
    rest of the kernel still uses global pids, but this is to be fixed soon. Init
    pid namespace still works.

    [oleg@tv-sign.ru: merge fix]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • * remove pid.h from pid_namespaces.h;
    * rework is_(cgroup|global)_init;
    * optimize (get|put)_pid_ns for init_pid_ns;
    * declare task_child_reaper to return actual reaper.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When searching the task by numerical id on may need to find it using global
    pid (as it is done now in kernel) or by its virtual id, e.g. when sending a
    signal to a task from one namespace the sender will specify the task's virtual
    id and we should find the task by this value.

    [akpm@linux-foundation.org: fix gfs2 linkage]
    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When showing pid to user or getting the pid numerical id for in-kernel use the
    value of this id may differ depending on the namespace.

    This set of helpers is used to get the global pid nr, the virtual (i.e. seen
    by task in its namespace) nr and the nr as it is seen from the specified
    namespace.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Each struct upid element of struct pid has to be initialized properly, i.e.
    its nr mst be allocated from appropriate pidmap and ns set to appropriate
    namespace.

    When allocating a new pid, we need to know the namespace this pid will live
    in, so the additional argument is added to alloc_pid().

    On the other hand, the rest of the kernel still uses the pid->nr and
    pid->pid_chain fields, so these ones are still initialized, but this will be
    removed soon.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Each namespace has a parent and is characterized by its "level". Level is the
    number of the namespace generation. E.g. init namespace has level 0, after
    cloning new one it will have level 1, the next one - 2 and so on and so forth.
    This level is not explicitly limited.

    True hierarchy must have some way to find each namespace's children, but it is
    not used in the patches, so this ability is not added (yet).

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • is_init() is an ambiguous name for the pid==1 check. Split it into
    is_global_init() and is_container_init().

    A cgroup init has it's tsk->pid == 1.

    A global init also has it's tsk->pid == 1 and it's active pid namespace
    is the init_pid_ns. But rather than check the active pid namespace,
    compare the task structure with 'init_pid_ns.child_reaper', which is
    initialized during boot to the /sbin/init process and never changes.

    Changelog:

    2.6.22-rc4-mm2-pidns1:
    - Use 'init_pid_ns.child_reaper' to determine if a given task is the
    global init (/sbin/init) process. This would improve performance
    and remove dependence on the task_pid().

    2.6.21-mm2-pidns2:

    - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
    ppc,avr32}/traps.c for the _exception() call to is_global_init().
    This way, we kill only the cgroup if the cgroup's init has a
    bug rather than force a kernel panic.

    [akpm@linux-foundation.org: fix comment]
    [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
    [bunk@stusta.de: kernel/pid.c: remove unused exports]
    [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • With multiple pid namespaces, a process is known by some pid_t in every
    ancestor pid namespace. Every time the process forks, the child process also
    gets a pid_t in every ancestor pid namespace.

    While a process is visible in >=1 pid namespaces, it can see pid_t's in only
    one pid namespace. We call this pid namespace it's "active pid namespace",
    and it is always the youngest pid namespace in which the process is known.

    This patch defines and uses a wrapper to find the active pid namespace of a
    process. The implementation of the wrapper will be changed in when support
    for multiple pid namespaces are added.

    Changelog:
    2.6.22-rc4-mm2-pidns1:
    - [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
    task_active_pid_ns() in child_reaper() since task->nsproxy
    can be NULL during task exit (so child_reaper() continues to
    use init_pid_ns).

    to implement child_reaper() since init_pid_ns.child_reaper to
    implement child_reaper() since tsk->nsproxy can be NULL during exit.

    2.6.21-rc6-mm1:
    - Rename task_pid_ns() to task_active_pid_ns() to reflect that a
    process can have multiple pid namespaces.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Add kmem_cache to pid_namespace to allocate pids from.

    Since both implementations expand the struct pid to carry more numerical
    values each namespace should have separate cache to store pids of different
    sizes.

    Each kmem cache is name "pid_", where is the number of numerical ids
    on the pid. Different namespaces with same level of nesting will have same
    caches.

    This patch has two FIXMEs that are to be fixed after we reach the consensus
    about the struct pid itself.

    The first one is that the namespace to free the pid from in free_pid() must be
    taken from pid. Now the init_pid_ns is used.

    The second FIXME is about the cache allocation. When we do know how long the
    object will be then we'll have to calculate this size in create_pid_cachep.
    Right now the sizeof(struct pid) value is used.

    [akpm@linux-foundation.org: coding-style repair]
    Signed-off-by: Pavel Emelianov
    Acked-by: Cedric Le Goater
    Acked-by: Sukadev Bhattiprolu
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     

17 Jul, 2007

1 commit

  • While working on unshare support for the network namespace I noticed we
    were putting clone flags in an int. Which is weird because the syscall
    uses unsigned long and we at least need an unsigned to properly hold all of
    the unshare flags.

    So to make the code consistent, this patch updates the code to use
    unsigned long instead of int for the clone flags in those places
    where we get it wrong today.

    Signed-off-by: Eric W. Biederman
    Acked-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

11 May, 2007

2 commits

  • Statically initialize a struct pid for the swapper process (pid_t == 0) and
    attach it to init_task. This is needed so task_pid(), task_pgrp() and
    task_session() interfaces work on the swapper process also.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc: Herbert Poetzl
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • attach_pid() currently takes a pid_t and then uses find_pid() to find the
    corresponding struct pid. Sometimes we already have the struct pid. We can
    then skip find_pid() if attach_pid() were to take a struct pid parameter.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

09 May, 2007

1 commit

  • sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
    namespaces. But they have different code paths.

    This patch merges all the nsproxy and its associated namespace copy/clone
    handling (as much as possible). Posted on container list earlier for
    feedback.

    - Create a new nsproxy and its associated namespaces and pass it back to
    caller to attach it to right process.

    - Changed all copy_*_ns() routines to return a new copy of namespace
    instead of attaching it to task->nsproxy.

    - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.

    - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
    just incase.

    - Get rid of all individual unshare_*_ns() routines and make use of
    copy_*_ns() instead.

    [akpm@osdl.org: cleanups, warning fix]
    [clg@fr.ibm.com: remove dup_namespaces() declaration]
    [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
    [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Cedric Le Goater
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

08 May, 2007

1 commit

  • This patch provides a new macro

    KMEM_CACHE(, )

    to simplify slab creation. KMEM_CACHE creates a slab with the name of the
    struct, with the size of the struct and with the alignment of the struct.
    Additional slab flags may be specified if necessary.

    Example

    struct test_slab {
    int a,b,c;
    struct list_head;
    } __cacheline_aligned_in_smp;

    test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)

    will create a new slab named "test_slab" of the size sizeof(struct
    test_slab) and aligned to the alignment of test slab. If it fails then we
    panic.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

31 Jan, 2007

1 commit

  • This is based on a patch by Eric W. Biederman, who pointed out that pid
    namespaces are still fake, and we only have one ever active.

    So for the time being, we can modify any code which could access
    tsk->nsproxy->pid_ns during task exit to just use &init_pid_ns instead,
    and move the exit_task_namespaces call in do_exit() back above
    exit_notify(), so that an exiting nfs server has a valid tsk->sighand to
    work with.

    Long term, pulling pid_ns out of nsproxy might be the cleanest solution.

    Signed-off-by: Eric W. Biederman

    [ Eric's patch fixed to take care of free_pid() too ]

    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

09 Dec, 2006

4 commits

  • Add a per pid_namespace child-reaper. This is needed so processes are reaped
    within the same pid space and do not spill over to the parent pid space. Its
    also needed so containers preserve existing semantic that pid == 1 would reap
    orphaned children.

    This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Add the pid namespace framework to the nsproxy object. The copy of the pid
    namespace only increases the refcount on the global pid namespace,
    init_pid_ns, and unshare is not implemented.

    There is no configuration option to activate or deactivate this feature
    because this not relevant for the moment.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Rename struct pspace to struct pid_namespace for consistency with other
    namespaces (uts_namespace and ipc_namespace). Also rename
    include/linux/pspace.h to include/linux/pid_namespace.h and variables from
    pspace to pid_ns.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

08 Dec, 2006

1 commit

  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

02 Oct, 2006

8 commits

  • proc_pid_make_inode:

    ei->pid = get_pid(task_pid(task));

    I think this is not safe. get_pid() can be preempted after checking "pid
    != NULL". Then the task exits, does detach_pid(), and RCU frees the pid.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This updates my proc: readdir race fix (take 3) patch
    to account for the changes made by: Sukadev Bhattiprolu
    to introduce struct pspace.

    Signed-off-by: Eric W. Biederman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Define a per-container pid space object. And create one instance of this
    object, init_pspace, to define the entire pid space. Subsequent patches
    will provide/use interfaces to create/destroy pid spaces.

    Its a subset/rework of Eric Biederman's patch
    http://lkml.org/lkml/2006/2/6/285 .

    Signed-off-by: Eric Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Move struct pidmap and PIDMAP_ENTRIES to a new file, include/linux/pspace.h
    where it will be used in subsequent patches to define pid spaces.

    Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/285

    [akpm@osdl.org: cleanups]
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Use struct pidmap instead of pidmap_t.

    This updates my proc: readdir race fix (take 3) patch
    to account for the changes made by: Sukadev Bhattiprolu
    to kill pidmap_t.

    Signed-off-by: Eric W. Biederman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Use struct pidmap instead of pidmap_t.

    Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/271.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • pids aren't something that drivers should care about. However there are a lot
    of helper layers in the kernel that do care, and are built as modules. Before
    I can convert them to using struct pid instead of pid_t I need to export the
    appropriate symbols so they can continue to be built.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The problem: An opendir, readdir, closedir sequence can fail to report
    process ids that are continually in use throughout the sequence of system
    calls. For this race to trigger the process that proc_pid_readdir stops at
    must exit before readdir is called again.

    This can cause ps to fail to report processes, and it is in violation of
    posix guarantees and normal application expectations with respect to
    readdir.

    Currently there is no way to work around this problem in user space short
    of providing a gargantuan buffer to user space so the directory read all
    happens in on system call.

    This patch implements the normal directory semantics for proc, that
    guarantee that a directory entry that is neither created nor destroyed
    while reading the directory entry will be returned. For directory that are
    either created or destroyed during the readdir you may or may not see them.
    Furthermore you may seek to a directory offset you have previously seen.

    These are the guarantee that ext[23] provides and that posix requires, and
    more importantly that user space expects. Plus it is a simple semantic to
    implement reliable service. It is just a matter of calling readdir a
    second time if you are wondering if something new has show up.

    These better semantics are implemented by scanning through the pids in
    numerical order and by making the file offset a pid plus a fixed offset.

    The pid scan happens on the pid bitmap, which when you look at it is
    remarkably efficient for a brute force algorithm. Given that a typical
    cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
    are only 40 cache lines for the entire 32K pid space. A typical system
    will have 100 pids or more so this is actually fewer cache lines we have to
    look at to scan a linked list, and the worst case of having to scan the
    entire pid bitmap is pretty reasonable.

    If we need something more efficient we can go to a more efficient data
    structure for indexing the pids, but for now what we have should be
    sufficient.

    In addition this takes no additional locks and is actually less code than
    what we are doing now.

    Also another very subtle bug in this area has been fixed. It is possible
    to catch a task in the middle of de_thread where a thread is assuming the
    thread of it's thread group leader. This patch carefully handles that case
    so if we hit it we don't fail to return the pid, that is undergoing the
    de_thread dance.

    Thanks to KAMEZAWA Hiroyuki for
    providing the first fix, pointing this out and working on it.

    [oleg@tv-sign.ru: fix it]
    Signed-off-by: Eric W. Biederman
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Oleg Nesterov
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

27 Sep, 2006

2 commits

  • With the patches flying between Oleg and myself somehow this temporary
    debug code got left in pid.c. It was never intended to make it to the
    stable kernel.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • In de_thread we move pids from one process to another, a rather ugly case.
    The function transfer_pid makes it clear what we are doing, and makes the
    action atomic. This is useful we ever want to atomically traverse the
    process group and session lists, in a rcu safe manner.

    Even if the atomic properties this change should be a win as transfer_pid
    should be less code to execute than executing both attach_pid and
    detach_pid, and this should make de_thread slightly smaller as only a
    single function call needs to be emitted. The only downside is that the
    code might be slower to execute as the odds are against transfer_pid being
    in cache.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

04 Jul, 2006

1 commit


01 Apr, 2006

1 commit

  • Simplifies the code, reduces the need for 4 pid hash tables, and makes the
    code more capable.

    In the discussions I had with Oleg it was felt that to a large extent the
    cleanup itself justified the work. With struct pid being dynamically
    allocated meant we could create the hash table entry when the pid was
    allocated and free the hash table entry when the pid was freed. Instead of
    playing with the hash lists when ever a process would attach or detach to a
    process.

    For myself the fact that it gave what my previous task_ref patch gave for free
    with simpler code was a big win. The problem is that if you hold a reference
    to struct task_struct you lock in 10K of low memory. If you do that in a user
    controllable way like /proc does, with an unprivileged but hostile user space
    application with typical resource limits of 1000 fds and 100 processes I can
    trigger the OOM killer by consuming all of low memory with task structs, on a
    machine wight 1GB of low memory.

    If I instead hold a reference to struct pid which holds a pointer to my
    task_struct, I don't suffer from that problem because struct pid is 2 orders
    of magnitude smaller. In fact struct pid is small enough that most other
    kernel data structures dwarf it, so simply limiting the number of referring
    data structures is enough to prevent exhaustion of low memory.

    This splits the current struct pid into two structures, struct pid and struct
    pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
    struct pid_link is the per process linkage into the hash tables and lives in
    struct task_struct. struct pid is given an indepedent lifetime, and holds
    pointers to each of the pid types.

    The independent life of struct pid simplifies attach_pid, and detach_pid,
    because we are always manipulating the list of pids and not the hash table.
    In addition in giving struct pid an indpendent life it makes the concept much
    more powerful.

    Kernel data structures can now embed a struct pid * instead of a pid_t and
    not suffer from pid wrap around problems or from keeping unnecessarily
    large amounts of memory allocated.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

29 Mar, 2006

2 commits

  • fork_idle() does unhash_process() just after copy_process(). Contrary,
    boot_cpu's idle thread explicitely registers itself for each pid_type with nr
    = 0.

    copy_process() already checks p->pid != 0 before process_counts++, I think we
    can just skip attach_pid() calls and job control inits for idle threads and
    kill unhash_process(). We don't need to cleanup ->proc_dentry in fork_idle()
    because with this patch idle threads are never hashed in
    kernel/pid.c:pid_hash[].

    We don't need to hash pid == 0 in pidmap_init(). free_pidmap() is never
    called with pid == 0 arg, so it will never be reused. So it is still possible
    to use pid == 0 in any PIDTYPE_xxx namespace from kernel/pid.c's POV.

    However with this patch we don't hash pid == 0 for PIDTYPE_PID case. We still
    have have PIDTYPE_PGID/PIDTYPE_SID entries with pid == 0: /sbin/init and
    kernel threads which don't call daemonize().

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • switch_exec_pids is only called from de_thread by way of exec, and it is
    only called when we are exec'ing from a non thread group leader.

    Currently switch_exec_pids gives the leader the pid of the thread and
    unhashes and rehashes all of the process groups. The leader is already in
    the EXIT_DEAD state so no one cares about it's pids. The only concern for
    the leader is that __unhash_process called from release_task will function
    correctly. If we don't touch the leader at all we know that
    __unhash_process will work fine so there is no need to touch the leader.

    For the task becomming the thread group leader, we just need to give it the
    pid of the old thread group leader, add it to the task list, and attach it
    to the session and the process group of the thread group.

    Currently de_thread is also adding the task to the task list which is just
    silly.

    Currently the only leader of __detach_pid besides detach_pid is
    switch_exec_pids because of the ugly extra work that was being
    performed.

    So this patch removes switch_exec_pids because it is doing too much, it is
    creating an unnecessary special case in pid.c, duing work duplicated in
    de_thread, and generally obscuring what it is going on.

    The necessary work is added to de_thread, and it seems to be a little
    clearer there what is going on.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman