30 Nov, 2007

2 commits

  • In wait_task_stopped() exit_code already contains the right value for the
    si_status member of siginfo, and this is simply set in the non WNOWAIT
    case.

    If you call waitid() with a stopped or traced process, you'll get the signal
    in siginfo.si_status as expected -- however if you call waitid(WNOWAIT) at the
    same time, you'll get the signal << 8 | 0x7f

    Pass it unchanged to wait_noreap_copyout(); we would only need to shift it
    and add 0x7f if we were returning it in the user status field and that
    isn't used for any function that permits WNOWAIT.

    Signed-off-by: Scott James Remnant
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Scott James Remnant
     
  • wait_task_stopped(WNOWAIT) does task_pid_nr_ns() without tasklist/rcu lock,
    we can read an already freed memory. Use the cached pid_t value.

    Signed-off-by: Oleg Nesterov
    Looks-good-to: Roland McGrath
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

16 Nov, 2007

1 commit

  • The original meaning of the old test (p->state > TASK_STOPPED) was
    "not dead", since it was before TASK_TRACED existed and before the
    state/exit_state split. It was a wrong correction in commit
    14bf01bb0599c89fc7f426d20353b76e12555308 to make this test for
    TASK_TRACED instead. It should have been changed when TASK_TRACED
    was introducted and again when exit_state was introduced.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Acked-by: Scott James Remnant
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

20 Oct, 2007

16 commits

  • Save ~650 bytes here.

    add/remove: 4/0 grow/shrink: 0/7 up/down: 430/-1088 (-658)
    function old new delta
    __copy_fs_struct - 202 +202
    __put_fs_struct - 112 +112
    __exit_fs - 58 +58
    __exit_files - 58 +58
    exit_files 58 2 -56
    put_fs_struct 112 5 -107
    exit_fs 161 2 -159
    sys_unshare 774 590 -184
    copy_process 4031 3840 -191
    do_exit 1791 1597 -194
    copy_fs_struct 202 5 -197

    No difference in lmbench lat_proc tests on 2-way Opteron 246.
    Smaaaal degradation on UP P4 (within errors).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Alexey Dobriyan
    Cc: Arjan van de Ven
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The task_struct->pid member is going to be deprecated, so start
    using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
    the kernel.

    The first thing to start with is the pid, printed to dmesg - in
    this case we may safely use task_pid_nr(). Besides, printks produce
    more (much more) than a half of all the explicit pid usage.

    [akpm@linux-foundation.org: git-drm went and changed lots of stuff]
    Signed-off-by: Pavel Emelyanov
    Cc: Dave Airlie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The pgrp field is not used widely around the kernel so it is now marked as
    deprecated with appropriate comment.

    The initialization of INIT_SIGNALS is trimmed because
    a) they are set to 0 automatically;
    b) gcc cannot properly initialize two anonymous (the second one
    is the one with the session) unions. In this particular case
    to make it compile we'd have to add some field initialized
    right before the .pgrp.

    This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one
    (from Cedric), but for the pgrp field.

    Some progress report:

    We have to deprecate the pid, tgid, session and pgrp fields on struct
    task_struct and struct signal_struct. The session and pgrp are already
    deprecated. The tgid value is close to being such - the worst known usage
    in in fs/locks.c and audit code. The pid field deprecation is mainly
    blocked by numerous printk-s around the kernel that print the tsk->pid to
    log.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This is the largest patch in the set. Make all (I hope) the places where
    the pid is shown to or get from user operate on the virtual pids.

    The idea is:
    - all in-kernel data structures must store either struct pid itself
    or the pid's global nr, obtained with pid_nr() call;
    - when seeking the task from kernel code with the stored id one
    should use find_task_by_pid() call that works with global pids;
    - when showing pid's numerical value to the user the virtual one
    should be used, but however when one shows task's pid outside this
    task's namespace the global one is to be used;
    - when getting the pid from userspace one need to consider this as
    the virtual one and use appropriate task/pid-searching functions.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: nuther build fix]
    [akpm@linux-foundation.org: yet nuther build fix]
    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Alexey Dobriyan
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Terminate all processes in a namespace when the reaper of the namespace is
    exiting. We do this by walking the pidmap of the namespace and sending
    SIGKILL to all processes.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • The first part is trivial - we just make the proc_flush_task() to operate on
    arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
    it.

    The other change is more tricky: I moved the proc_flush_task() call in
    release_task() higher to address the following problem.

    When flushing task from many proc trees we need to know the set of ids (not
    just one pid) to find the dentries' names to flush. Thus we need to pass the
    task's pid to proc_flush_task() as struct pid is the only object that can
    provide all the pid numbers. But after __exit_signal() task has detached all
    his pids and this information is lost.

    This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
    tree and keep them in hash (since pids are still alive before __exit_signal())
    till the next shrink, but since proc_flush_task() does not provide a 100%
    guarantee that the dentries will be flushed, this is OK to do so.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Make task release its namespaces after it has reparented all his children to
    child_reaper, but before it notifies its parent about its death.

    The reason to release namespaces after reparenting is that when task exits it
    may send a signal to its parent (SIGCHLD), but if the parent has already
    exited its namespaces there will be no way to decide what pid to dever to him
    - parent can be from different namespace.

    The reason to release namespace before notifying the parent it that when task
    sends a SIGCHLD to parent it can call wait() on this taks and release it. But
    releasing the mnt namespace implies dropping of all the mounts in the mnt
    namespace and NFS expects the task to have valid sighand pointer.

    Thanks to Oleg for pointing out some races that can apear and helping with
    patches and fixes.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • A pid namespace is a "view" of a particular set of tasks on the system. They
    work in a similar way to filesystem namespaces. A file (or a process) can be
    accessed in multiple namespaces, but it may have a different name in each. In
    a filesystem, this name might be /etc/passwd in one namespace, but
    /chroot/etc/passwd in another.

    For processes, a process may have pid 1234 in one namespace, but be pid 1 in
    another. This allows new pid namespaces to have basically arbitrary pids, and
    not have to worry about what pids exist in other namespaces. This is
    essential for checkpoint/restart where a restarted process's pid might collide
    with an existing process on the system's pid.

    In this particular implementation, pid namespaces have a parent-child
    relationship, just like processes. A process in a pid namespace may see all
    of the processes in the same namespace, as well as all of the processes in all
    of the namespaces which are children of its namespace. Processes may not,
    however, see others which are in their parent's namespace, but not in their
    own. The same goes for sibling namespaces.

    The know issue to be solved in the nearest future is signal handling in the
    namespace boundary. That is, currently the namespace's init is treated like
    an ordinary task that can be killed from within an namespace. Ideally, the
    signal handling by the namespace's init should have two sides: when signaling
    the init from its namespace, the init should look like a real init task, i.e.
    receive only those signals, that is explicitly wants to; when signaling the
    init from one of the parent namespaces, init should look like an ordinary
    task, i.e. receive any signal, only taking the general permissions into
    account.

    The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and
    we eventually came to almost the same implementation, which differed in some
    details. This set is based on Pavel's patches, but it includes comments and
    patches that from Sukadev.

    Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made
    valuable advises on how to make this set cleaner.

    This patch:

    We have to call exit_task_namespaces() only after the exiting task has
    reparented all his children and is sure that no other threads will reparent
    theirs for it. Why this is needed is explained in appropriate patch. This
    one only reworks the forget_original_parent() so that after calling this a
    task cannot be/become parent of any other task.

    We check PF_EXITING instead of ->exit_state while choosing the new parent.
    Note that tasklits_lock acts as a barrier, everyone who takes tasklist after
    us (when forget_original_parent() drops it) must see PF_EXITING.

    The other changes are just cleanups. They just move some code from
    exit_notify to forget_original_parent(). It is a bit silly to declare
    ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to
    forget_original_parent(), unlock-lock-unlock tasklist, and then use
    ptrace_dead.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Signed-off-by: Daniel Walker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Walker
     
  • kernel/exit.c: Convert list_for_each(_safe) to
    list_for_each_entry(_safe) in forget_original_parent(), exit_notify()
    and do_wait()

    Signed-off-by: Matthias Kaehlcke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthias Kaehlcke
     
  • When someone wants to deal with some other taks's namespaces it has to lock
    the task and then to get the desired namespace if the one exists. This is
    slow on read-only paths and may be impossible in some cases.

    E.g. Oleg recently noticed a race between unshare() and the (sent for
    review in cgroups) pid namespaces - when the task notifies the parent it
    has to know the parent's namespace, but taking the task_lock() is
    impossible there - the code is under write locked tasklist lock.

    On the other hand switching the namespace on task (daemonize) and releasing
    the namespace (after the last task exit) is rather rare operation and we
    can sacrifice its speed to solve the issues above.

    The access to other task namespaces is proposed to be performed
    like this:

    rcu_read_lock();
    nsproxy = task_nsproxy(tsk);
    if (nsproxy != NULL) {
    / *
    * work with the namespaces here
    * e.g. get the reference on one of them
    * /
    } / *
    * NULL task_nsproxy() means that this task is
    * almost dead (zombie)
    * /
    rcu_read_unlock();

    This patch has passed the review by Eric and Oleg :) and,
    of course, tested.

    [clg@fr.ibm.com: fix unshare()]
    [ebiederm@xmission.com: Update get_net_ns_by_pid]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Serge Hallyn
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • is_init() is an ambiguous name for the pid==1 check. Split it into
    is_global_init() and is_container_init().

    A cgroup init has it's tsk->pid == 1.

    A global init also has it's tsk->pid == 1 and it's active pid namespace
    is the init_pid_ns. But rather than check the active pid namespace,
    compare the task structure with 'init_pid_ns.child_reaper', which is
    initialized during boot to the /sbin/init process and never changes.

    Changelog:

    2.6.22-rc4-mm2-pidns1:
    - Use 'init_pid_ns.child_reaper' to determine if a given task is the
    global init (/sbin/init) process. This would improve performance
    and remove dependence on the task_pid().

    2.6.21-mm2-pidns2:

    - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
    ppc,avr32}/traps.c for the _exception() call to is_global_init().
    This way, we kill only the cgroup if the cgroup's init has a
    bug rather than force a kernel panic.

    [akpm@linux-foundation.org: fix comment]
    [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
    [bunk@stusta.de: kernel/pid.c: remove unused exports]
    [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Rename the child_reaper() function to task_child_reaper() to be similar to
    other task_* functions and to distinguish the function from 'struct
    pid_namspace.child_reaper'.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • The set of functions process_session, task_session, process_group and
    task_pgrp is confusing, as the names can be mixed with each other when looking
    at the code for a long time.

    The proposals are to
    * equip the functions that return the integer with _nr suffix to
    represent that fact,
    * and to make all functions work with task (not process) by making
    the common prefix of the same name.

    For monotony the routines signal_session() and set_signal_session() are
    replaced with task_session_nr() and set_task_session(), especially since they
    are only used with the explicit task->signal dereference.

    Signed-off-by: Pavel Emelianov
    Acked-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Cedric Le Goater
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     
  • Remove the filesystem support logic from the cpusets system and makes cpusets
    a cgroup subsystem

    The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
    passed through to the cgroup filesystem with the appropriate options to
    emulate the old cpuset filesystem behaviour.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • This adds the necessary hooks to the fork() and exit() paths to ensure
    that new children inherit their parent's cgroup assignments, and that
    exiting processes release reference counts on their cgroups.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

17 Oct, 2007

10 commits

  • robust_list, compat_robust_list, pi_state_list, pi_state_cache are
    really used if futexes are on.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • de_thread() yields waiting for ->group_leader to be a zombie. This deadlocks
    if an rt-prio execer shares the same cpu with ->group_leader. Change the code
    to use ->group_exit_task/notify_count mechanics.

    This patch certainly uglifies the code, perhaps someone can suggest something
    better.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The child was found on ->children list under tasklist_lock, it must have a
    valid ->signal. __exit_signal() both removes the task from parent->children
    and clears ->signal "atomically" under write_lock(tasklist).

    Remove unneeded checks.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The "p->exit_signal == -1 && p->ptrace == 0" check and the comment are
    bogus. We already did exactly the same check in eligible_child(), we did
    not drop tasklist_lock since then, and both variables need
    write_lock(tasklist) to be changed.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • ->siglock provides enough protection to iterate over the thread group.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Two threads, T1 and T2. T2 ptraces P, and P is not a child of ptracer's
    thread group. P exits and goes to TASK_ZOMBIE.

    T1 does wait_task_zombie(P):

    P->exit_state = TASK_DEAD;
    ...
    read_unlock(&tasklist_lock);

    T2 does exit(), takes tasklist,
    forget_original_parent() does
    __ptrace_unlink(P) but doesn't
    call do_notify_parent(P) because
    p->exit_state == EXIT_DEAD.

    Now, P is not visible to our process: __ptrace_unlink() removed it from
    ->children. We should send notification to P->parent and release P if and
    only if SIGCHLD is ignored.

    And we have 3 bugs:

    1. P->parent does do_wait() and gets -ECHILD (P is on ->parent->children,
    but its state is TASK_DEAD).

    2. // wait_task_zombie() continues

    if (put_user(...)) {
    // TODO: is this safe?
    p->exit_state = EXIT_ZOMBIE;
    return;
    }

    we return without notification/release, task_struct leaked.

    Solution: ignore -EFAULT and proceed. It is an application's bug if
    we can't fill infop/stat_addr (in case of VM_FAULT_OOM we have much
    more problems).

    3. // wait_task_zombie() continues

    if (p->real_parent != p->parent) {
    // Not taken, it was untraced'ed
    ...
    }

    release_task(p);

    we released the task which we shouldn't.

    Solution: check ->real_parent != ->parent before, under tasklist_lock,
    but use ptrace_unlink() instead of __ptrace_unlink() to check ->ptrace.

    This patch hopefully solves 2 and 3, the 1st bug will be fixed later, we need
    some cleanups in forget_original_parent/reparent_thread.

    However, the first race is very unlikely and not critical, so I hope it makes
    sense to fix 1 and 2 for now.

    4. Small cleanup: don't "restore" EXIT_ZOMBIE unless we know we are not going
    to realease the child.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A zombie must have a valid ->signal, we are going to release it and
    __exit_signal() starts with BUG_ON(!sig).

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • With or without this patch, multi-threaded init's are not fully supported,
    but do_exit() is completely wrong. This becomes a real problem when we
    support pid namespaces.

    1. do_exit() panics when the main thread of /sbin/init exits. It should not
    until the whole thread group exits. Move the code below, under the
    "if (group_dead)" check.

    Note: this means that forget_original_parent() can use an already dead
    child_reaper()'s task_struct. This is OK for /sbin/init because

    - do_wait() from alive sub-thread still can reap a zombie, we iterate
    over all sub-thread's ->children lists

    - do_notify_parent() will wakeup some alive sub-thread because it sends
    the group-wide signal

    However, we should remove choose_new_parent()->BUG_ON(reaper->exit_state)
    for this.

    2. We are playing games with ->nsproxy->pid_ns. This code is bogus today, and
    it has to be changed anyway when we really support pid namespaces, just
    remove it.

    Signed-off-by: Oleg Nesterov
    Roland McGrath
    Cc: "Eric W. Biederman"
    Cc: Sukadev Bhattiprolu
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • It is a bit annoying that do_exit() takes ->pi_lock to set PF_EXITING. All
    we need is to synchronize with lookup_pi_state() which saw this task
    without PF_EXITING under ->pi_lock.

    Change do_exit() to use spin_unlock_wait().

    Signed-off-by: Oleg Nesterov
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch cleans up duplicate includes in
    kernel/

    Signed-off-by: Jesper Juhl
    Acked-by: Paul E. McKenney
    Reviewed-by: Satyam Sharma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     

15 Oct, 2007

1 commit


21 Sep, 2007

1 commit

  • This simplifies signalfd code, by avoiding it to remain attached to the
    sighand during its lifetime.

    In this way, the signalfd remain attached to the sighand only during
    poll(2) (and select and epoll) and read(2). This also allows to remove
    all the custom "tsk == current" checks in kernel/signal.c, since
    dequeue_signal() will only be called by "current".

    I think this is also what Ben was suggesting time ago.

    The external effect of this, is that a thread can extract only its own
    private signals and the group ones. I think this is an acceptable
    behaviour, in that those are the signals the thread would be able to
    fetch w/out signalfd.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

31 Aug, 2007

1 commit

  • taskstats.ac_exitcode is assigned to task_struct.exit_code in bacct_add_tsk()
    through the following kernel function calls:

    do_exit()
    taskstats_exit()
    fill_pid()
    bacct_add_tsk()

    The problem is that in do_exit(), task_struct.exit_code is set to 'code' only
    after taskstats_exit() has been called. So we need to move the assignment
    before taskstats_exit().

    Signed-off-by: Jonathan Lim
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Lim
     

04 Aug, 2007

1 commit

  • There is a couple of subtle checks which were needed to handle ptracing from
    the same thread group. This was deprecated a long ago, imho this code just
    complicates the understanding.

    And, the "->parent->signal->flags & SIGNAL_GROUP_EXIT" check in exit_notify()
    is not right. SIGNAL_GROUP_EXIT can mean exec(), not exit_group(). This means
    ptracer can lose a ptraced zombie on exec(). Minor problem, but still the bug.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

20 Jul, 2007

1 commit

  • Kernel threads should not have TIF_FREEZE set when user space processes are
    being frozen, since otherwise some of them might be frozen prematurely.
    To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE
    unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks()
    check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags
    when user space processes are to be frozen.

    Namely, when user space processes are being frozen, we only should set
    TIF_FREEZE for tasks that have p->mm different from NULL and don't have
    PF_BORROWED_MM set in p->flags. For this reason task_lock() must be used to
    prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which
    p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p). Also, we
    need to prevent the following scenario from happening:

    * daemonize() is called by a task spawned from a user space code path
    * freezer checks if the task has p->mm set and the result is positive
    * task enters exit_mm() and clears its TIF_FREEZE
    * freezer sets TIF_FREEZE for the task
    * task calls try_to_freeze() and goes to the refrigerator, which is wrong at
    that point

    This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and
    p->mm are examined and release it after TIF_FREEZE is set for p (or it turns
    out that TIF_FREEZE should not be set).

    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Cc: Nigel Cunningham
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

18 Jul, 2007

1 commit

  • Currently, the freezer treats all tasks as freezable, except for the kernel
    threads that explicitly set the PF_NOFREEZE flag for themselves. This
    approach is problematic, since it requires every kernel thread to either
    set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
    care for the freezing of tasks at all.

    It seems better to only require the kernel threads that want to or need to
    be frozen to use some freezer-related code and to remove any
    freezer-related code from the other (nonfreezable) kernel threads, which is
    done in this patch.

    The patch causes all kernel threads to be nonfreezable by default (ie. to
    have PF_NOFREEZE set by default) and introduces the set_freezable()
    function that should be called by the freezable kernel threads in order to
    unset PF_NOFREEZE. It also makes all of the currently freezable kernel
    threads call set_freezable(), so it shouldn't cause any (intentional)
    change of behaviour to appear. Additionally, it updates documentation to
    describe the freezing of tasks more accurately.

    [akpm@linux-foundation.org: build fixes]
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Nigel Cunningham
    Cc: Pavel Machek
    Cc: Oleg Nesterov
    Cc: Gautham R Shenoy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

17 Jul, 2007

2 commits

  • Add TTY input auditing, used to audit system administrator's actions. This is
    required by various security standards such as DCID 6/3 and PCI to provide
    non-repudiation of administrator's actions and to allow a review of past
    actions if the administrator seems to overstep their duties or if the system
    becomes misconfigured for unknown reasons. These requirements do not make it
    necessary to audit TTY output as well.

    Compared to an user-space keylogger, this approach records TTY input using the
    audit subsystem, correlated with other audit events, and it is completely
    transparent to the user-space application (e.g. the console ioctls still
    work).

    TTY input auditing works on a higher level than auditing all system calls
    within the session, which would produce an overwhelming amount of mostly
    useless audit events.

    Add an "audit_tty" attribute, inherited across fork (). Data read from TTYs
    by process with the attribute is sent to the audit subsystem by the kernel.
    The audit netlink interface is extended to allow modifying the audit_tty
    attribute, and to allow sending explanatory audit events from user-space (for
    example, a shell might send an event containing the final command, after the
    interactive command-line editing and history expansion is performed, which
    might be difficult to decipher from the TTY input alone).

    Because the "audit_tty" attribute is inherited across fork (), it would be set
    e.g. for sshd restarted within an audited session. To prevent this, the
    audit_tty attribute is cleared when a process with no open TTY file
    descriptors (e.g. after daemon startup) opens a TTY.

    See https://www.redhat.com/archives/linux-audit/2007-June/msg00000.html for a
    more detailed rationale document for an older version of this patch.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Miloslav Trmac
    Cc: Al Viro
    Cc: Alan Cox
    Cc: Paul Fulghum
    Cc: Casey Schaufler
    Cc: Steve Grubb
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miloslav Trmac
     
  • Add generic exit-time stack-depth checking to CONFIG_DEBUG_STACK_USAGE.

    This also adds UML support.

    Tested on UML and i386.

    [akpm@linux-foundation.org: cleanups, speedups, tweaks]
    Signed-off-by: Jeff Dike
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Dike
     

10 Jul, 2007

3 commits