17 May, 2007

1 commit


12 May, 2007

1 commit


11 May, 2007

3 commits

  • This patch series implements the new signalfd() system call.

    I took part of the original Linus code (and you know how badly it can be
    broken :), and I added even more breakage ;) Signals are fetched from the same
    signal queue used by the process, so signalfd will compete with standard
    kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
    the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
    seems to be working fine on my Dual Opteron machine. I made a quick test
    program for it:

    http://www.xmailserver.org/signafd-test.c

    The signalfd() system call implements signal delivery into a file descriptor
    receiver. The signalfd file descriptor if created with the following API:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

    The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
    to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
    signalfd file.

    The "mask" allows to specify the signal mask of signals that we are interested
    in. The "masksize" parameter is the size of "mask".

    The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
    will return POLLIN when signals are available to be dequeued. As a direct
    consequence of supporting the Linux poll subsystem, the signalfd fd can use
    used together with epoll(2) too.

    The read(2) system call will return a "struct signalfd_siginfo" structure in
    the userspace supplied buffer. The return value is the number of bytes copied
    in the supplied buffer, or -1 in case of error. The read(2) call can also
    return 0, in case the sighand structure to which the signalfd was attached,
    has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
    return -EAGAIN in case no signal is available.

    If the size of the buffer passed to read(2) is lower than sizeof(struct
    signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
    return -ERESTARTSYS in case a signal hits the process. The format of the
    struct signalfd_siginfo is, and the valid fields depends of the (->code &
    __SI_MASK) value, in the same way a struct siginfo would:

    struct signalfd_siginfo {
    __u32 signo; /* si_signo */
    __s32 err; /* si_errno */
    __s32 code; /* si_code */
    __u32 pid; /* si_pid */
    __u32 uid; /* si_uid */
    __s32 fd; /* si_fd */
    __u32 tid; /* si_fd */
    __u32 band; /* si_band */
    __u32 overrun; /* si_overrun */
    __u32 trapno; /* si_trapno */
    __s32 status; /* si_status */
    __s32 svint; /* si_int */
    __u64 svptr; /* si_ptr */
    __u64 utime; /* si_utime */
    __u64 stime; /* si_stime */
    __u64 addr; /* si_addr */
    };

    [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • attach_pid() currently takes a pid_t and then uses find_pid() to find the
    corresponding struct pid. Sometimes we already have the struct pid. We can
    then skip find_pid() if attach_pid() were to take a struct pid parameter.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Hi,

    I have been working on some code that detects abnormal events based on audit
    system events. One kind of event that we currently have no visibility for is
    when a program terminates due to segfault - which should never happen on a
    production machine. And if it did, you'd want to investigate it. Attached is a
    patch that collects these events and sends them into the audit system.

    Signed-off-by: Steve Grubb
    Signed-off-by: Al Viro

    Steve Grubb
     

09 May, 2007

2 commits

  • When a binary format is unregistered and re-registered, register_binfmt
    fails with -EBUSY. The reason is that unregister_binfmt does not set
    fmt->next to NULL, and seeing (fmt->next != NULL), register_binfmt fails
    with -EBUSY.

    One can find his way around by explicitly setting fmt->next to NULL after
    unregistering, but that is kind of unclean (one should better be using only
    the interfaces, and not the interal members, isn't it?)

    Attached one-liner can fix it.

    Signed-off-by: Kalash Nainwal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    kalash nainwal
     
  • Petr Tesarik discovered a problem in remove_arg_zero(). He writes:

    When a script is loaded, load_script() replaces argv[0] with the
    name of the interpreter and the filename passed to the exec syscall.
    However, there is no guarantee that the length of the interpreter
    name plus the length of the filename is greater than the length of
    the original argv[0]. If the difference happens to cross a page boundary,
    setup_arg_pages() will call put_dirty_page() [aka install_arg_page()]
    with an address outside the VMA.

    Therefore, remove_arg_zero() must free all pages which would be unused
    after the argument is removed.

    So, rewrite the remove_arg_zero function without gotos, with a few comments,
    and with the commonly used explicit index/offset. This fixes the problem
    and makes it easier to understand as well.

    [a.p.zijlstra@chello.nl: add comment]
    Signed-off-by: Nick Piggin
    Cc: Petr Tesarik
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Apr, 2007

1 commit

  • The patch checks for "|" in the pattern not the output and doesn't nail a
    pid on to a piped name (as it is a program name not a file)

    Also fixes a very very obscure security corner case. If you happen to have
    decided on a core pattern that starts with the program name then the user
    can run a program called "|myevilhack" as it stands. I doubt anyone does
    this.

    Signed-off-by: Alan Cox
    Confirmed-by: Christopher S. Aker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

12 Feb, 2007

1 commit

  • Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
    corresponding "kmem_cache_zalloc()" call.

    Signed-off-by: Robert P. J. Day
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Cc: Roland McGrath
    Cc: James Bottomley
    Cc: Greg KH
    Acked-by: Joel Becker
    Cc: Steven Whitehouse
    Cc: Jan Kara
    Cc: Michael Halcrow
    Cc: "David S. Miller"
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     

11 Dec, 2006

1 commit

  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

09 Dec, 2006

2 commits

  • Add a per pid_namespace child-reaper. This is needed so processes are reaped
    within the same pid space and do not spill over to the parent pid space. Its
    also needed so containers preserve existing semantic that pid == 1 would reap
    orphaned children.

    This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

08 Dec, 2006

2 commits

  • On Sat, Dec 02, 2006 at 11:47:44PM +0300, Alexey Dobriyan wrote:
    > David Binderman compiled 2.6.19 with icc and grepped for "was set but never
    > used". Many warnings are on
    > http://coderock.org/kj/unused-2.6.19-fs

    Heh, the very first line:
    fs/exec.c(1465): remark #593: variable "flag" was set but never used

    fs/exec.c:
    1477 /*
    1478 * We cannot trust fsuid as being the "true" uid of the
    1479 * process nor do we know its entire history. We only know it
    1480 * was tainted so we dump it as root in mode 2.
    1481 */
    1482 if (mm->dumpable == 2) { /* Setuid core dump mode */
    1483 flag = O_EXCL; /* Stop rewrite attacks */
    1484 current->fsuid = 0; /* Dump root private */
    1485 }

    And then filp_open follows with "flag" totally ignored.

    (akpm: this restores the code to Alan's original version. Andi's "Support
    piping into commands in /proc/sys/kernel/core_pattern" (cset d025c9db) broke
    it).

    Cc: Alan Cox
    Cc:
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

02 Oct, 2006

1 commit

  • Replace references to system_utsname to the per-process uts namespace
    where appropriate. This includes things like uname.

    Changes: Per Eric Biederman's comments, use the per-process uts namespace
    for ELF_PLATFORM, sunrpc, and parts of net/ipv4/ipconfig.c

    [jdike@addtoit.com: UML fix]
    [clg@fr.ibm.com: cleanup]
    [akpm@osdl.org: build fix]
    Signed-off-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Cedric Le Goater
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

01 Oct, 2006

2 commits

  • Using the infrastructure created in previous patches implement support to
    pipe core dumps into programs.

    This is done by overloading the existing core_pattern sysctl
    with a new syntax:

    |program

    When the first character of the pattern is a '|' the kernel will instead
    threat the rest of the pattern as a command to run. The core dump will be
    written to the standard input of that program instead of to a file.

    This is useful for having automatic core dump analysis without filling up
    disks. The program can do some simple analysis and save only a summary of
    the core dump.

    The core dump proces will run with the privileges and in the name space of
    the process that caused the core dump.

    I also increased the core pattern size to 128 bytes so that longer command
    lines fit.

    Most of the changes comes from allowing core dumps without seeks. They are
    fairly straight forward though.

    One small incompatibility is that if someone had a core pattern previously
    that started with '|' they will get suddenly new behaviour. I think that's
    unlikely to be a real problem though.

    Additional background:

    > Very nice, do you happen to have a program that can accept this kind of
    > input for crash dumps? I'm guessing that the embedded people will
    > really want this functionality.

    I had a cheesy demo/prototype. Basically it wrote the dump to a file again,
    ran gdb on it to get a backtrace and wrote the summary to a shared directory.
    Then there was a simple CGI script to generate a "top 10" crashes HTML
    listing.

    Unfortunately this still had the disadvantage to needing full disk space for a
    dump except for deleting it afterwards (in fact it was worse because over the
    pipe holes didn't work so if you have a holey address map it would require
    more space).

    Fortunately gdb seems to be happy to handle /proc/pid/fd/xxx input pipes as
    cores (at least it worked with zsh's =(cat core) syntax), so it would be
    likely possible to do it without temporary space with a simple wrapper that
    calls it in the right way. I ran out of time before doing that though.

    The demo prototype scripts weren't very good. If there is really interest I
    can dig them out (they are currently on a laptop disk on the desk with the
    laptop itself being in service), but I would recommend to rewrite them for any
    serious application of this and fix the disk space problem.

    Also to be really useful it should probably find a way to automatically fetch
    the debuginfos (I cheated and just installed them in advance). If nobody else
    does it I can probably do the rewrite myself again at some point.

    My hope at some point was that desktops would support it in their builtin
    crash reporters, but at least the KDE people I talked too seemed to be happy
    with their user space only solution.

    Alan sayeth:

    I don't believe that piping as such as neccessarily the right model, but
    the ability to intercept and processes core dumps from user space is asked
    for by many enterprise users as well. They want to know about, capture,
    analyse and process core dumps, often centrally and in automated form.

    [akpm@osdl.org: loff_t != unsigned long]
    Signed-off-by: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • There were a few accounting data/macros that are used in CSA but are #ifdef'ed
    inside CONFIG_BSD_PROCESS_ACCT. This patch is to change those ifdef's from
    CONFIG_BSD_PROCESS_ACCT to CONFIG_TASK_XACCT. A few defines are moved from
    kernel/acct.c and include/linux/acct.h to kernel/tsacct.c and
    include/linux/tsacct_kern.h.

    Signed-off-by: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Jes Sorensen
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jay Lan
     

30 Sep, 2006

1 commit

  • Fixed race on put_files_struct on exec with proc. Restoring files on
    current on error path may lead to proc having a pointer to already kfree-d
    files_struct.

    ->files changing at exit.c and khtread.c are safe as exit_files() makes all
    things under lock.

    Found during OpenVZ stress testing.

    [akpm@osdl.org: add export]
    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

27 Sep, 2006

2 commits

  • Ingo Oeser pointed out that because current expands to an inline function
    it is more space efficient and somewhat faster to simply keep a cached copy
    of current in another variable. This patch implements that for the
    de_thread function.

    (akpm: saves nearly 100 bytes of text on x86)

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • In de_thread we move pids from one process to another, a rather ugly case.
    The function transfer_pid makes it clear what we are doing, and makes the
    action atomic. This is useful we ever want to atomically traverse the
    process group and session lists, in a rcu safe manner.

    Even if the atomic properties this change should be a win as transfer_pid
    should be less code to execute than executing both attach_pid and
    detach_pid, and this should make de_thread slightly smaller as only a
    single function call needs to be emitted. The only downside is that the
    code might be slower to execute as the odds are against transfer_pid being
    in cache.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

28 Aug, 2006

1 commit

  • This fixes the locking error noticed by lockdep:

    =============================================
    [ INFO: possible recursive locking detected ]
    ---------------------------------------------
    init/1 is trying to acquire lock:
    (&sighand->siglock){....}, at: [] flush_old_exec+0x3ae/0x859

    but task is already holding lock:
    (&sighand->siglock){....}, at: [] flush_old_exec+0x39e/0x859

    other info that might help us debug this:
    2 locks held by init/1:
    #0: (tasklist_lock){..--}, at: [] flush_old_exec+0x38e/0x859
    #1: (&sighand->siglock){....}, at: [] flush_old_exec+0x39e/0x859

    stack backtrace:
    [] show_trace_log_lvl+0x54/0xfd
    [] show_trace+0xd/0x10
    [] dump_stack+0x19/0x1b
    [] __lock_acquire+0x773/0x997
    [] lock_acquire+0x4b/0x6c
    [] _spin_lock+0x19/0x28
    [] flush_old_exec+0x3ae/0x859
    [] load_elf_binary+0x4aa/0x1628
    [] search_binary_handler+0xa7/0x24e
    [] do_execve+0x15b/0x1f9
    [] sys_execve+0x29/0x4d
    [] syscall_call+0x7/0xb

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

25 Aug, 2006

2 commits


01 Jul, 2006

1 commit


27 Jun, 2006

8 commits

  • This patch optimizes zap_threads() for the case when there are no ->mm
    users except the current's thread group. In that case we can avoid
    'for_each_process()' loop.

    It also adds a useful invariant: SIGNAL_GROUP_EXIT (if checked under
    ->siglock) always implies that all threads (except may be current) have
    pending SIGKILL.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This is a preparation for the next patch. No functional changes.
    Basically, this patch moves '->flags & SIGNAL_GROUP_EXIT' check into
    zap_threads(), and 'complete(vfork_done)' into coredump_wait outside of
    ->mmap_sem protected area.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch removes tasklist_lock from zap_threads().
    This is safe wrt:

    do_exit:
    The caller holds mm->mmap_sem. This means that task which
    shares the same ->mm can't pass exit_mm(), so it can't be
    unhashed from init_task.tasks or ->thread_group lists.

    fork:
    None of sub-threads can fork after zap_process(leader). All
    processes which were created before this point should be
    visible to zap_threads() because copy_process() adds the new
    process to the tail of init_task.tasks list, and ->siglock
    lock/unlock provides a memory barrier.

    de_thread:
    It does list_replace_rcu(&leader->tasks, ¤t->tasks).
    So zap_threads() will see either old or new leader, it does
    not matter. However, it can change p->sighand, so we should
    use lock_task_sighand() in zap_process().

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • With this patch zap_process() sets SIGNAL_GROUP_EXIT while sending SIGKILL to
    the thread group. This means that a TASK_TRACED task

    1. Will be awakened by signal_wake_up(1)

    2. Can't sleep again via ptrace_notify()

    3. Can't go to do_signal_stop() after return
    from ptrace_stop() in get_signal_to_deliver()

    So we can remove all ptrace related stuff from coredump path.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • With this patch a thread group is killed atomically under ->siglock. This is
    faster because we can use sigaddset() instead of force_sig_info() and this is
    used in further patches.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • zap_threads() iterates over all threads to find those ones which share
    current->mm. All threads in the thread group share the same ->mm, so we can
    skip entire thread group if it has another ->mm.

    This patch shifts the killing of thread group into the newly added
    zap_process() function. This looks as unnecessary complication, but it is
    used in further patches.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • We should keep the value of old_leader->tasks.next in de_thread, otherwise
    we can't do for_each_process/do_each_thread without tasklist_lock held.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • To keep the dcache from filling up with dead /proc entries we flush them on
    process exit. However over the years that code has gotten hairy with a
    dentry_pointer and a lock in task_struct and misdocumented as a correctness
    feature.

    I have rewritten this code to look and see if we have a corresponding entry in
    the dcache and if so flush it on process exit. This removes the extra fields
    in the task_struct and allows me to trivially handle the case of a
    /proc//task/ entry as well as the current /proc/ entries.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

23 Jun, 2006

1 commit

  • This patch removes the steal_locks() function.

    steal_locks() doesn't work correctly with any filesystem that does it's own
    lock management, including NFS, CIFS, etc.

    In addition it has weird semantics on local filesystems in case tasks
    sharing file-descriptor tables are doing POSIX locking operations in
    parallel to execve().

    The steal_locks() function has an effect on applications doing:

    clone(CLONE_FILES)
    /* in child */
    lock
    execve
    lock

    POSIX locks acquired before execve (by "child", "parent" or any further
    task sharing files_struct) will after the execve be owned exclusively by
    "child".

    According to Chris Wright some LSB/LTP kind of suite triggers without the
    stealing behavior, but there's no known real-world application that would
    also fail.

    Apps using NPTL are not affected, since all other threads are killed before
    execve.

    Apps using LinuxThreads are only affected if they

    - have multiple threads during exec (LinuxThreads doesn't kill other
    threads, the app may do it with pthread_kill_other_threads_np())
    - rely on POSIX locks being inherited across exec

    Both conditions are documented, but not their interaction.

    Apps using clone() natively are affected if they

    - use clone(CLONE_FILES)
    - rely on POSIX locks being inherited across exec

    The above scenarios are unlikely, but possible.

    If the patch is vetoed, there's a plan B, that involves mostly keeping the
    weird stealing semantics, but changing the way lock ownership is handled so
    that network and local filesystems work consistently.

    That would add more complexity though, so this solution seems to be
    preferred by most people.

    Signed-off-by: Miklos Szeredi
    Cc: Trond Myklebust
    Cc: Matthew Wilcox
    Cc: Chris Wright
    Cc: Christoph Hellwig
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

20 Jun, 2006

1 commit


20 Apr, 2006

1 commit

  • While we can currently walk through thread groups, process groups, and
    sessions with just the rcu_read_lock, this opens the door to walking the
    entire task list.

    We already have all of the other RCU guarantees so there is no cost in
    doing this, this should be enough so that proc can stop taking the
    tasklist lock during readdir.

    prev_task was killed because it has no users, and using it will miss new
    tasks when doing an rcu traversal.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

14 Apr, 2006

1 commit

  • This is two distinct changes.
    - Not changing our real parents.
    - Not changing our ptrace parents.

    Not changing our real parents is trivially correct because both tasks
    have the same real parents as they are part of a thread group. Now that
    we demote the leader to a thread there is no longer any reason to change
    it's parentage.

    Not changing our ptrace parents is a user visible change if someone
    looks hard enough. I don't think user space applications will care or
    even notice.

    In the practical and I think common case a debugger will have attached
    to all of the threads using the same ptrace flags. From my quick skim
    of strace and gdb that appears to be the case. Which if true means
    debuggers will not notice a change.

    Before this point we have already generated a ptrace event in do_exit
    that reports the leaders pid has died so de_thread is visible to a
    debugger. Which means attempting to hide this case by copying flags
    around appears excessive.

    By not doing anything it avoids all of the weird locking issues between
    de_thread and ptrace attach, and removes one case from consideration for
    fixing the ptrace locking.

    This only addresses Oleg's first concern with ptrace_attach, that of the
    problems caused by reparenting. Oleg's second concern is essentially a
    race between ptrace_attach and release_task that causes an oops when we
    get to force_sig_specific. There is nothing special about de_thread
    with respect to that race.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

11 Apr, 2006

2 commits

  • The only record we have of the real-time age of a process, regardless of
    execs it's done, is start_time. When a non-leader thread exec, the
    original start_time of the process is lost. Things looking at the
    real-time age of the process are fooled, for example the process accounting
    record when the process finally dies. This change makes the oldest
    start_time stick around with the process after a non-leader exec. This way
    the association between PID and start_time is kept constant, which seems
    correct to me.

    Signed-off-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Oleg Nesterov spotted two interesting bugs with the current de_thread
    code. The simplest is a long standing double decrement of
    __get_cpu_var(process_counts) in __unhash_process. Caused by
    two processes exiting when only one was created.

    The other is that since we no longer detach from the thread_group list
    it is possible for do_each_thread when run under the tasklist_lock to
    see the same task_struct twice. Once on the task list as a
    thread_group_leader, and once on the thread list of another
    thread.

    The double appearance in do_each_thread can cause a double increment
    of mm_core_waiters in zap_threads resulting in problems later on in
    coredump_wait.

    To remedy those two problems this patch takes the simple approach
    of changing the old thread group leader into a child thread.
    The only routine in release_task that cares is __unhash_process,
    and it can be trivially seen that we handle cleaning up a
    thread group leader properly.

    Since de_thread doesn't change the pid of the exiting leader process
    and instead shares it with the new leader process. I change
    thread_group_leader to recognize group leadership based on the
    group_leader field and not based on pids. This should also be
    slightly cheaper then the existing thread_group_leader macro.

    I performed a quick audit and I couldn't see any user of
    thread_group_leader that cared about the difference.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

01 Apr, 2006

1 commit


29 Mar, 2006

1 commit

  • This patch borrows a clever Hugh's 'struct anon_vma' trick.

    Without tasklist_lock held we can't trust task->sighand until we locked it
    and re-checked that it is still the same.

    But this means we don't need to defer 'kmem_cache_free(sighand)'. We can
    return the memory to slab immediately, all we need is to be sure that
    sighand->siglock can't dissapear inside rcu protected section.

    To do so we need to initialize ->siglock inside ctor function,
    SLAB_DESTROY_BY_RCU does the rest.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov