29 Mar, 2007

1 commit

  • In commit 0475ac0845f9295bc5f69af45f58dff2c104c8d1 when converting the
    orphaned process group handling to use struct pid I made a small
    mistake. I accidentally replaced an == with a !=.

    Besides just being a dumb thing to do apparently this has a bad side
    effect. The improper orphaned process group detection causes kwin to
    die after a suspend/resume cycle.

    I'm amazed this patch has been around as long as it has without anyone
    else noticing something funny going on.

    And the following people deserve credit for spotting and helping
    to reproduce this.

    Thanks to: Sid Boyce
    Thanks to: "Michael Wu"

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

13 Feb, 2007

4 commits

  • Every call to is_orphaned_pgrp passed in process_group(current) which is racy
    with respect to another thread changing our process group. It didn't bite us
    because we were dealing with integers and the worse we would get would be a
    stale answer.

    In switching the checks to use struct pid to be a little more efficient and
    prepare the way for pid namespaces this race became apparent.

    So I simplified the calls to the more specialized is_current_pgrp_orphaned so
    I didn't have to worry about making logic changes to avoid the race.

    Signed-off-by: Eric W. Biederman
    Cc: Alan Cox
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Modify has_stopped_jobs and will_become_orphan_pgrp to use struct pid based
    process groups. This reduces the number of hash tables looks ups and paves
    the way for multiple pid spaces.

    Signed-off-by: Eric W. Biederman
    Cc: Alan Cox
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • To properly implement a pid namespace I need to deal exclusively in terms of
    struct pid, because pid_t values become ambiguous.

    To this end session_of_pgrp is transformed to take and return a struct pid
    pointer. To avoid the need to worry about reference counting I now require my
    caller to hold the appropriate locks. Leaving callers repsonsible for
    increasing the reference count if they need access to the result outside of
    the locks.

    Since session_of_pgrp currently only has one caller and that caller simply
    uses only test the result for equality with another process group, the locking
    change means I don't actually have to acquire the tasklist_lock at all.

    tiocspgrp is also modified to take and release the lock. The logic there is a
    little more complicated but nothing I won't need when I convert pgrp of a tty
    to a struct pid pointer.

    Signed-off-by: Eric W. Biederman
    Cc: Alan Cox
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • close_files() can sometimes take long enough to trigger the soft lockup
    detector.

    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

12 Feb, 2007

1 commit

  • A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
    source files, including:

    * make multi-line initial descriptions single line
    * denote some function names, constants and structs as such
    * change erroneous opening '/*' to '/**' in a few places
    * reword some text for clarity

    Signed-off-by: Robert P. J. Day
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     

31 Jan, 2007

3 commits

  • This is based on a patch by Eric W. Biederman, who pointed out that pid
    namespaces are still fake, and we only have one ever active.

    So for the time being, we can modify any code which could access
    tsk->nsproxy->pid_ns during task exit to just use &init_pid_ns instead,
    and move the exit_task_namespaces call in do_exit() back above
    exit_notify(), so that an exiting nfs server has a valid tsk->sighand to
    work with.

    Long term, pulling pid_ns out of nsproxy might be the cleanest solution.

    Signed-off-by: Eric W. Biederman

    [ Eric's patch fixed to take care of free_pid() too ]

    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This reverts commit 7a238fcba0629b6f2edbcd37458bae56fcf36be5 in
    preparation for a better and simpler fix proposed by Eric Biederman
    (and fixed up by Serge Hallyn)

    Acked-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Fix exit race by splitting the nsproxy putting into two pieces. First
    piece reduces the nsproxy refcount. If we dropped the last reference, then
    it puts the mnt_ns, and returns the nsproxy as a hint to the caller. Else
    it returns NULL. The second piece of exiting task namespaces sets
    tsk->nsproxy to NULL, and drops the references to other namespaces and
    frees the nsproxy only if an nsproxy was passed in.

    A little awkward and should probably be reworked, but hopefully it fixes
    the NFS oops.

    Signed-off-by: Serge E. Hallyn
    Cc: Herbert Poetzl
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Cedric Le Goater
    Cc: Daniel Hokka Zakrisson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

01 Jan, 2007

1 commit

  • Commit b2b2cbc4b2a2f389442549399a993a8306420baf introduced a user-
    visible change: ->pdeath_signal is sent only when the entire thread
    group exits.

    While this change is imho good, it may break things. So restore the
    old behaviour for now.

    Signed-off-by: Oleg Nesterov
    To: Albert Cahalan
    Cc: Eric W. Biederman
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Ingo Molnar
    Cc: Qi Yong
    Cc: Roland McGrath
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

23 Dec, 2006

2 commits

  • This patch fixes the case when we reparent to a different thread in the
    same thread group. This modifies the code so that we do not send
    signals and do not change the signal to send to SIGCHLD unless we have
    change the thread group of our parents. It also suppresses sending
    pdeath_sig in this cas as well since the result of geppid doesn't
    change.

    Thanks to Oleg for spotting my bug of only fixing this for non-ptraced
    tasks.

    Signed-off-by: Eric W. Biederman
    Cc: Mike Galbraith
    Cc: Albert Cahalan
    Cc: Andrew Morton
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Coywolf Qi Hunt
    Acked-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Christoph Hellwig has expressed concerns that the recent fdtable changes
    expose the details of the RCU methodology used to release no-longer-used
    fdtable structures to the rest of the kernel. The trivial patch below
    addresses these concerns by introducing the appropriate free_fdtable()
    calls, which simply wrap the release RCU usage. Since free_fdtable() is a
    one-liner, it makes sense to promote it to an inline helper.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

11 Dec, 2006

2 commits

  • An fdtable can either be embedded inside a files_struct or standalone (after
    being expanded). When an fdtable is being discarded after all RCU references
    to it have expired, we must either free it directly, in the standalone case,
    or free the files_struct it is contained within, in the embedded case.

    Currently the free_files field controls this behavior, but we can get rid of
    it entirely, as all the necessary information is already recorded. We can
    distinguish embedded and standalone fdtables using max_fds, and if it is
    embedded we can divine the relevant files_struct using container_of().

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

09 Dec, 2006

7 commits

  • All members of the process group have the same sid and it can't be == 0.

    NOTE: this code (and a similar one in sys_setpgid) was needed because it
    was possibe to have ->session == 0. It's not possible any longer since

    [PATCH] pidhash: don't use zero pids
    Commit: c7c6464117a02b0d54feb4ebeca4db70fa493678

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Add a per pid_namespace child-reaper. This is needed so processes are reaped
    within the same pid space and do not spill over to the parent pid space. Its
    also needed so containers preserve existing semantic that pid == 1 would reap
    orphaned children.

    This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
    other namespaces being developped for the containers : pid, uts, ipc, etc.
    'namespace' variables and attributes are also renamed to 'mnt_ns'

    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • Add an anonymous union and ((deprecated)) to catch direct usage of the
    session field.

    [akpm@osdl.org: fix various missed conversions]
    [jdike@addtoit.com: fix UML bug]
    Signed-off-by: Jeff Dike
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Replace occurences of task->signal->session by a new process_session() helper
    routine.

    It will be useful for pid namespaces to abstract the session pid number.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Make set_special_pids() static, the only caller is daemonize().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Fix the locking of signal->tty.

    Use ->sighand->siglock to protect ->signal->tty; this lock is already used
    by most other members of ->signal/->sighand. And unless we are 'current'
    or the tasklist_lock is held we need ->siglock to access ->signal anyway.

    (NOTE: sys_unshare() is broken wrt ->sighand locking rules)

    Note that tty_mutex is held over tty destruction, so while holding
    tty_mutex any tty pointer remains valid. Otherwise the lifetime of ttys
    are governed by their open file handles. This leaves some holes for tty
    access from signal->tty (or any other non file related tty access).

    It solves the tty SLAB scribbles we were seeing.

    (NOTE: the change from group_send_sig_info to __group_send_sig_info needs to
    be examined by someone familiar with the security framework, I think
    it is safe given the SEND_SIG_PRIV from other __group_send_sig_info
    invocations)

    [schwidefsky@de.ibm.com: 3270 fix]
    [akpm@osdl.org: various post-viro fixes]
    Signed-off-by: Peter Zijlstra
    Acked-by: Alan Cox
    Cc: Oleg Nesterov
    Cc: Prarit Bhargava
    Cc: Chris Wright
    Cc: Roland McGrath
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Jan Kara
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

08 Dec, 2006

1 commit

  • do_exit:
    taskstats_exit_alloc()
    ...
    taskstats_exit_send()
    taskstats_exit_free()

    I think this is not good, let it be a single function exported to the core
    kernel, taskstats_exit(), which does alloc + send + free itself.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Shailabh Nagar
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

29 Oct, 2006

1 commit

  • taskstats_tgid_free() is called on copy_process's error path. This is wrong.

    IF (clone_flags & CLONE_THREAD)
    We should not clear ->signal->taskstats, current uses it,
    it probably has a valid accumulated info.
    ELSE
    taskstats_tgid_init() set ->signal->taskstats = NULL,
    there is nothing to free.

    Move the callsite to __exit_signal(). We don't need any locking, entire
    thread group is exiting, nobody should have a reference to soon to be
    released ->signal.

    Signed-off-by: Oleg Nesterov
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Oct, 2006

3 commits

  • exit_task_namespaces() has replaced the former exit_namespace(). It
    invalidates task->nsproxy and associated namespaces. This is an issue for
    the (futur) pid namespace which is required to be valid in exit_notify().

    This patch moves exit_task_namespaces() after exit_notify() to keep nsproxy
    valid.

    Signed-off-by: Cedric Le Goater
    Cc: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • This moves the mount namespace into the nsproxy. The mount namespace count
    now refers to the number of nsproxies point to it, rather than the number of
    tasks. As a result, the unshare_namespace() function in kernel/fork.c no
    longer checks whether it is being shared.

    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch adds a nsproxy structure to the task struct. Later patches will
    move the fs namespace pointer into this structure, and introduce a new utsname
    namespace into the nsproxy.

    The vserver and openvz functionality, then, would be implemented in large part
    by virtualizing/isolating more and more resources into namespaces, each
    contained in the nsproxy.

    [akpm@osdl.org: build fix]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

01 Oct, 2006

2 commits


30 Sep, 2006

8 commits

  • I am not sure about this patch, I am asking Ingo to take a decision.

    task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion
    it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should
    live only in ->exit_state as documented in sched.h.

    Note that this state is not visible to user-space, get_task_state() masks off
    unsuitable states.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • After the previous change (->flags & PF_DEAD) (->state == EXIT_DEAD), we
    don't need PF_DEAD any longer.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD
    to ensure that the exiting task will be deactivated. Note that this EXIT_DEAD
    is in fact a "random" value, we can use any bit except normal TASK_XXX values.

    It is better to set this state in do_exit() along with PF_DEAD flag and remove
    that check in schedule().

    We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it
    can not change task's ->state: the 'state' argument of try_to_wake_up() can't
    have EXIT_DEAD bit. And in case when try_to_wake_up() sees a stale value of
    ->state == TASK_RUNNING it will do nothing.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Remove open-coded has_rt_policy(), no changes in kernel/exit.o

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If we are going to BUG() not panic() here then we should cover the case of
    the BUG being compiled out

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • This check has been obsolete since the introduction of TASK_TRACED. Now
    TASK_STOPPED always means job control stop.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This is an updated version of Eric Biederman's is_init() patch.
    (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and
    replaces a few more instances of ->pid == 1 with is_init().

    Further, is_init() checks pid and thus removes dependency on Eric's other
    patches for now.

    Eric's original description:

    There are a lot of places in the kernel where we test for init
    because we give it special properties. Most significantly init
    must not die. This results in code all over the kernel test
    ->pid == 1.

    Introduce is_init to capture this case.

    With multiple pid spaces for all of the cases affected we are
    looking for only the first process on the system, not some other
    process that has pid == 1.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc:
    Acked-by: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Fixed race on put_files_struct on exec with proc. Restoring files on
    current on error path may lead to proc having a pointer to already kfree-d
    files_struct.

    ->files changing at exit.c and khtread.c are safe as exit_files() makes all
    things under lock.

    Found during OpenVZ stress testing.

    [akpm@osdl.org: add export]
    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

03 Sep, 2006

1 commit

  • It is not possible to find a sub-thread in ->children/->ptrace_children
    lists, ptrace_attach() does not allow to attach to sub-threads.

    Even if it was possible to ptrace the task from the same thread group,
    we can't allow to release ->group_leader while there are others (ptracer)
    threads in the same group.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Sep, 2006

1 commit

  • Cleanup allocation and freeing of tsk->delays used by delay accounting.
    This solves two problems reported for delay accounting:

    1. oops in __delayacct_blkio_ticks
    http://www.uwsg.indiana.edu/hypermail/linux/kernel/0608.2/1844.html

    Currently tsk->delays is getting freed too early in task exit which can
    cause a NULL tsk->delays to get accessed via reading of /proc//stats.
    The patch fixes this problem by freeing tsk->delays closer to when
    task_struct itself is freed up. As a result, it also eliminates the use of
    tsk->delays_lock which was only being used (inadequately) to safeguard
    access to tsk->delays while a task was exiting.

    2. Possible memory leak in kernel/delayacct.c
    http://www.uwsg.indiana.edu/hypermail/linux/kernel/0608.2/1389.html

    The patch cleans up tsk->delays allocations after a bad fork which was
    missing earlier.

    The patch has been tested to fix the problems listed above and stress
    tested with rapid calls to delay accounting's taskstats command interface
    (which is the other path that can access the same data, besides the /proc
    interface causing the oops above).

    Signed-off-by: Shailabh Nagar
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     

15 Jul, 2006

2 commits

  • On systems with a large number of cpus, with even a modest rate of tasks
    exiting per cpu, the volume of taskstats data sent on thread exit can
    overflow a userspace listener's buffers.

    One approach to avoiding overflow is to allow listeners to get data for a
    limited and specific set of cpus. By scaling the number of listeners
    and/or the cpus they monitor, userspace can handle the statistical data
    overload more gracefully.

    In this patch, each listener registers to listen to a specific set of cpus
    by specifying a cpumask. The interest is recorded per-cpu. When a task
    exits on a cpu, its taskstats data is unicast to each listener interested
    in that cpu.

    Thanks to Andrew Morton for pointing out the various scalability and
    general concerns of previous attempts and for suggesting this design.

    [akpm@osdl.org: build fix]
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • Send per-tgid data only once during exit of a thread group instead of once
    with each member thread exit.

    Currently, when a thread exits, besides its per-tid data, the per-tgid data
    of its thread group is also sent out, if its thread group is non-empty.
    The per-tgid data sent consists of the sum of per-tid stats for all
    *remaining* threads of the thread group.

    This patch modifies this sending in two ways:

    - the per-tgid data is sent only when the last thread of a thread group
    exits. This cuts down heavily on the overhead of sending/receiving
    per-tgid data, especially when other exploiters of the taskstats
    interface aren't interested in per-tgid stats

    - the semantics of the per-tgid data sent are changed. Instead of being
    the sum of per-tid data for remaining threads, the value now sent is the
    true total accumalated statistics for all threads that are/were part of
    the thread group.

    The patch also addresses a minor issue where failure of one accounting
    subsystem to fill in the taskstats structure was causing the send of
    taskstats to not be sent at all.

    The patch has been tested for stability and run cerberus for over 4 hours
    on an SMP.

    [akpm@osdl.org: bugfixes]
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar