03 Apr, 2006

1 commit

  • It doesn't make the splice itself necessarily nonblocking (because the
    actual file descriptors that are spliced from/to may block unless they
    have the O_NONBLOCK flag set), but it makes the splice pipe operations
    nonblocking.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Apr, 2006

24 commits

  • * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
    [NET]: Allow skb headroom to be overridden
    [TCP]: Kill unused extern decl for tcp_v4_hash_connecting()
    [NET]: add SO_RCVBUF comment
    [NET]: Deinline some larger functions from netdevice.h
    [DCCP]: Use NULL for pointers, comfort sparse.
    [DECNET]: Fix refcount

    Linus Torvalds
     
  • As announced, lookup_hash() can now become static.

    Signed-off-by: Adrian Bunk
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • The monochrome->color expansion routine that handles bitmaps which have
    (widths % 8) != 0 (slow_imageblit) produces corrupt characters in big-endian.
    This is caused by a bogus bit test in slow_imageblit().

    Fix.

    This patch may deserve to go to the stable tree. The code has already been
    well tested in little-endian machines. It's only in big-endian where there is
    uncertainty and Herbert confirmed that this is the correct way to go.

    It should not introduce regressions.

    Signed-off-by: Antonino Daplas
    Acked-by: Herbert Poetzl
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Antonino A. Daplas
     
  • Backlight class attributes are currently easy to implement incorrectly.
    Moving certain handling into the backlight core prevents this whilst at the
    same time makes the drivers simpler and consistent. The following changes are
    included:

    The brightness attribute only sets and reads the brightness variable in the
    backlight_properties structure.

    The power attribute only sets and reads the power variable in the
    backlight_properties structure.

    Any framebuffer blanking events change a variable fb_blank in the
    backlight_properties structure.

    The backlight driver has only two functions to implement. One function is
    called when any of the above properties change (to update the backlight
    brightness), the second is called to return the current backlight brightness
    value. A new attribute "actual_brightness" is added to return this brightness
    as determined by the driver having combined all the above factors (and any
    driver/device specific factors).

    Additionally, the backlight core takes care of checking the maximum brightness
    is not exceeded and of turning off the backlight before device removal.

    The corgi backlight driver is updated to reflect these changes.

    Signed-off-by: Richard Purdie
    Signed-off-by: Antonino Daplas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Purdie
     
  • It is very common to hash a dentry and then to call lookup. If we take fs
    specific hash functions into account the full hash logic can get ugly.
    Further full_name_hash as an inline function is almost 100 bytes on x86 so
    having a non-inline choice in some cases can measurably decrease code size.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Simplifies the code, reduces the need for 4 pid hash tables, and makes the
    code more capable.

    In the discussions I had with Oleg it was felt that to a large extent the
    cleanup itself justified the work. With struct pid being dynamically
    allocated meant we could create the hash table entry when the pid was
    allocated and free the hash table entry when the pid was freed. Instead of
    playing with the hash lists when ever a process would attach or detach to a
    process.

    For myself the fact that it gave what my previous task_ref patch gave for free
    with simpler code was a big win. The problem is that if you hold a reference
    to struct task_struct you lock in 10K of low memory. If you do that in a user
    controllable way like /proc does, with an unprivileged but hostile user space
    application with typical resource limits of 1000 fds and 100 processes I can
    trigger the OOM killer by consuming all of low memory with task structs, on a
    machine wight 1GB of low memory.

    If I instead hold a reference to struct pid which holds a pointer to my
    task_struct, I don't suffer from that problem because struct pid is 2 orders
    of magnitude smaller. In fact struct pid is small enough that most other
    kernel data structures dwarf it, so simply limiting the number of referring
    data structures is enough to prevent exhaustion of low memory.

    This splits the current struct pid into two structures, struct pid and struct
    pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
    struct pid_link is the per process linkage into the hash tables and lives in
    struct task_struct. struct pid is given an indepedent lifetime, and holds
    pointers to each of the pid types.

    The independent life of struct pid simplifies attach_pid, and detach_pid,
    because we are always manipulating the list of pids and not the hash table.
    In addition in giving struct pid an indpendent life it makes the concept much
    more powerful.

    Kernel data structures can now embed a struct pid * instead of a pid_t and
    not suffer from pid wrap around problems or from keeping unnecessarily
    large amounts of memory allocated.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • A big problem with rcu protected data structures that are also reference
    counted is that you must jump through several hoops to increase the reference
    count. I think someone finally implemented atomic_inc_not_zero(&count) to
    automate the common case. Unfortunately this means you must special case the
    rcu access case.

    When data structures are only visible via rcu in a manner that is not
    determined by the reference count on the object (i.e. tasks are visible until
    their zombies are reaped) there is a much simpler technique we can employ.
    Simply delaying the decrement of the reference count until the rcu interval is
    over.

    What that means is that the proc code that looks up a task and later
    wants to sleep can now do:

    rcu_read_lock();
    task = find_task_by_pid(some_pid);
    if (task) {
    get_task_struct(task);
    }
    rcu_read_unlock();

    The effect on the rest of the kernel is that put_task_struct becomes cheaper
    and immediate, and in the case where the task has been reaped it frees the
    task immediate instead of unnecessarily waiting an until the rcu interval is
    over.

    Cleanup of task_struct does not happen when its reference count drops to
    zero, instead cleanup happens when release_task is called. Tasks can only
    be looked up via rcu before release_task is called. All rcu protected
    members of task_struct are freed by release_task.

    Therefore we can move call_rcu from put_task_struct into release_task. And
    we can modify release_task to not immediately release the reference count
    but instead have it call put_task_struct from the function it gives to
    call_rcu.

    The end result:

    - get_task_struct is safe in an rcu context where we have just looked
    up the task.

    - put_task_struct() simplifies into its old pre rcu self.

    This reorganization also makes put_task_struct uncallable from modules as
    it is not exported but it does not appear to be called from any modules so
    this should not be an issue, and is trivially fixed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This just got nuked in mainline. Bring it back because Eric's patches use it.

    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • To increase the strength of SCHED_BATCH as a scheduling hint we can
    activate batch tasks on the expired array since by definition they are
    latency insensitive tasks.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • The activated flag in task_struct is used to track different sleep types and
    its usage is somewhat obfuscated. Convert the variable to an enum with more
    descriptive names without altering the function.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Currently, count_active_tasks() calls both nr_running() &
    nr_interruptible(). Each of these functions does a "for_each_cpu" & reads
    values from the runqueue of each cpu. Although this is not a lot of
    instructions, each runqueue may be located on different node. Depending on
    the architecture, a unique TLB entry may be required to access each
    runqueue.

    Since there may be more runqueues than cpu TLB entries, a scan of all
    runqueues can trash the TLB. Each memory reference incurs a TLB miss &
    refill.

    In addition, the runqueue cacheline that contains nr_running &
    nr_uninterruptible may be evicted from the cache between the two passes.
    This causes unnecessary cache misses.

    Combining nr_running() & nr_interruptible() into a single function
    substantially reduces the TLB & cache misses on large systems. This should
    have no measureable effect on smaller systems.

    On a 128p IA64 system running a memory stress workload, the new function
    reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.

    Signed-off-by: Jack Steiner
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • The removal of the data field in the hrtimer structure enforces the
    embedding of the timer into another data structure. nanosleep now uses a
    private implementation of the most common used timer callback function
    (simple task wakeup).

    In order to avoid the reimplentation of such functionality all over the
    place a generic hrtimer_sleeper functionality is created.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Add an LED trigger for IDE disk activity to the ide-disk driver.

    Signed-off-by: Richard Purdie
    Acked-by: Bartlomiej Zolnierkiewicz
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Purdie
     
  • Add support for LED triggers to the LED subsystem. "Triggers" are events
    which change the state of an LED. Two kinds of trigger are available, simple
    ones which can be added to exising code with minimum disruption and complex
    ones for implementing new or more complex functionality.

    Signed-off-by: Richard Purdie
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Purdie
     
  • Add the foundations of a new LEDs subsystem. This patch adds a class which
    presents LED devices within sysfs and allows their brightness to be
    controlled.

    Signed-off-by: Richard Purdie
    Cc: Russell King
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Purdie
     
  • Add TIOCL_GETKMSGREDIRECT needed by the userland suspend tool to get the
    current value of kmsg_redirect from the kernel so that it can save it and
    restore it after resume.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
    fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
    Reasons:

    - It's more flexible. Things which would require two or three syscalls with
    fadvise() can be done in a single syscall.

    - Using fadvise() in this manner is something not covered by POSIX.

    The patch wires up the syscall for x86.

    The sycall is implemented in the new fs/sync.c. The intention is that we can
    move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

    Documentation for the syscall is in fs/sync.c.

    A test app (sync_file_range.c) is in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
    say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
    NFS_DATA_SYNC which is hopefully the more common."

    Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
    the queue is congested. This is trivial to fix: add a new flag bit, set
    wbc->nonblocking. But I'm not sure that we want to expose implementation
    details down to that level.

    Note: it's notable that we can sync an fd which wasn't opened for writing.
    Same with fsync() and fdatasync()).

    Note: the code takes some care to handle attempts to sync file contents
    outside the 16TB offset on 32-bit machines. It makes such attempts appear to
    succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
    requests fail...

    Cc: Nick Piggin
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Matt Domsch noticed a startup race with the IPMI kernel thread, it was
    possible (though extraordinarly unlikely) that a message could come in
    before the upper layer was ready to handle it. This patch splits the
    startup processing of an IPMI interface into two parts, one to get ready
    and one to actually start the processes to receive messages from the
    interface.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Corey Minyard
    Cc: Matt Domsch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corey Minyard
     
  • Make baby-simple the code for /proc/devices. Based on the proven design
    for /proc/interrupts.

    This also fixes the early-termination regression 2.6.16 introduced, as
    demonstrated by:

    # dd if=/proc/devices bs=1
    Character devices:
    1 mem
    27+0 records in
    27+0 records out

    This should also work (but is untested) when /proc/devices >4096 bytes,
    which I believe is what the original 2.6.16 rewrite fixed.

    [akpm@osdl.org: cleanups, simplifications]
    Signed-off-by: Joe Korty
    Cc: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Korty
     
  • Commit a4a6198b80cf82eb8160603c98da218d1bd5e104:
    [PATCH] tvec_bases too large for per-cpu data

    introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at
    compile time. This means we can kill __init_timer_base and move
    timer_base_s's content into tvec_t_base_s.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • find_trylock_page() is an odd interface in that it doesn't take a reference
    like the others. Now that XFS no longer uses it, and its last remaining
    caller actually wants an elevated refcount, opencode that callsite and
    schedule find_trylock_page() for removal.

    Signed-off-by: Nick Piggin
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix migrate_pages_to() definition.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • - chips/sharp.c: make two needlessly global functions static

    - move some declarations to a header file where they belong to

    Signed-off-by: Adrian Bunk
    Acked-by: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Cc: Thomas Gleixner
    Acked-by: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

31 Mar, 2006

4 commits

  • Previously we added NET_IP_ALIGN so an architecture can override the
    padding done to align headers. The next step is to allow the skb
    headroom to be overridden.

    We currently always reserve 16 bytes to grow into, meaning all DMAs
    start 16 bytes into a cacheline. On ppc64 we really want DMA writes to
    start on a cacheline boundary, so we increase that headroom to one
    cacheline.

    Signed-off-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Anton Blanchard
     
  • * 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev:
    [PATCH] sata_mv: three bug fixes
    [PATCH] libata: ata_dev_init_params() fixes
    [PATCH] libata: Fix interesting use of "extern" and also some bracketing
    [PATCH] libata: Simplex and other mode filtering logic
    [PATCH] libata - ATA is both ATA and CFA
    [PATCH] libata: Add ->set_mode hook for odd drivers
    [PATCH] libata: BMDMA handling updates
    [PATCH] libata: kill trailing whitespace
    [PATCH] libata: add FIXME above ata_dev_xfermask()
    [PATCH] libata: cosmetic changes in ata_bus_softreset()
    [PATCH] libata: kill E.D.D.

    Linus Torvalds
     
  • This enables the caller to migrate pages from one address space page
    cache to another. In buzz word marketing, you can do zero-copy file
    copies!

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • This adds support for the sys_splice system call. Using a pipe as a
    transport, it can connect to files or sockets (latter as output only).

    From the splice.c comments:

    "splice": joining two ropes together by interweaving their strands.

    This is the "extended pipe" functionality, where a pipe is used as
    an arbitrary in-memory buffer. Think of a pipe as a small kernel
    buffer that you can use to transfer data from one end to the other.

    The traditional unix read/write is extended with a "splice()" operation
    that transfers data buffers to or from a pipe buffer.

    Named by Larry McVoy, original implementation from Linus, extended by
    Jens to support splicing to files and fixing the initial implementation
    bugs.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

30 Mar, 2006

6 commits

  • Add a field to the host_set called 'flags' (was host_set_flags changed
    to suit Jeff)
    Add a simplex_claimed field so we can remember who owns the DMA channel
    Add a ->mode_filter() hook to allow drivers to filter modes
    Add docs for mode_filter and set_mode
    Filter according to simplex state
    Filter cable in core

    This provides the needed framework to support all the mode rules found
    in the PATA world. The simplex filter deals with 'to spec' simplex DMA
    systems found in older chips. The cable filter avoids duplicating the
    same rules in each chip driver with PATA. Finally the mode filter is
    neccessary because drive/chip combinations have errata that forbid
    certain modes with some drives or types of ATA object.

    Drive speed setup remains per channel for now and the filters now use
    the framework Tejun put into place which cleans them up a lot from the
    older libata-pata patches.

    Signed-off-by: Alan Cox
    Signed-off-by: Jeff Garzik

    Alan Cox
     
  • Some hardware doesn't want the usual mode setup logic running. This
    allows the hardware driver to replace it for special cases in the least
    invasive way possible.

    Signed-off-by: Alan Cox
    Signed-off-by: Jeff Garzik

    Alan Cox
     
  • This is the minimal patch set to enable the current code to be used with
    a controller following SFF (ie any PATA and early SATA controllers)
    safely without crashes if there is no BMDMA area or if BMDMA is not
    assigned by the BIOS for some reason.

    Simplex status is recorded but not acted upon in this change, this isn't
    a problem with the current drivers as none of them are for simplex
    hardware. A following diff will deal with that.

    The flags in the probe structure remain ->host_set_flags although Jeff
    asked me to rename them, simply because the rename would break the usual
    Linux rules that old code should break when there are changes. not
    compile and run and then blow up/eat your computer/etc. Renaming this
    later is a trivial exercise once a better name is chosen.

    Signed-off-by: Jeff Garzik

    Alan Cox
     
  • On a allyesconfig'ured kernel:

    Size Uses Wasted Name and definition
    ===== ==== ====== ================================================
    95 162 12075 netif_wake_queue include/linux/netdevice.h
    129 86 9265 dev_kfree_skb_any include/linux/netdevice.h
    127 56 5885 netif_device_attach include/linux/netdevice.h
    73 86 4505 dev_kfree_skb_irq include/linux/netdevice.h
    46 60 1534 netif_device_detach include/linux/netdevice.h
    119 16 1485 __netif_rx_schedule include/linux/netdevice.h
    143 5 492 netif_rx_schedule include/linux/netdevice.h
    81 7 366 netif_schedule include/linux/netdevice.h

    netif_wake_queue is big because __netif_schedule is a big inline:

    static inline void __netif_schedule(struct net_device *dev)
    {
    if (!test_and_set_bit(__LINK_STATE_SCHED, &dev->state)) {
    unsigned long flags;
    struct softnet_data *sd;

    local_irq_save(flags);
    sd = &__get_cpu_var(softnet_data);
    dev->next_sched = sd->output_queue;
    sd->output_queue = dev;
    raise_softirq_irqoff(NET_TX_SOFTIRQ);
    local_irq_restore(flags);
    }
    }

    static inline void netif_wake_queue(struct net_device *dev)
    {
    #ifdef CONFIG_NETPOLL_TRAP
    if (netpoll_trap())
    return;
    #endif
    if (test_and_clear_bit(__LINK_STATE_XOFF, &dev->state))
    __netif_schedule(dev);
    }

    By de-inlining __netif_schedule we are saving a lot of text
    at each callsite of netif_wake_queue and netif_schedule.
    __netif_rx_schedule is also big, and it makes more sense to keep
    both of them out of line.

    Patch also deinlines dev_kfree_skb_any. We can deinline dev_kfree_skb_irq
    instead... oh well.

    netif_device_attach/detach are not hot paths, we can deinline them too.

    Signed-off-by: Denis Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Denis Vlasenko
     
  • Jeff Garzik
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (67 commits)
    [PATCH] powerpc: Remove oprofile spinlock backtrace code
    [PATCH] powerpc: Add oprofile calltrace support to all powerpc cpus
    [PATCH] powerpc: Add oprofile calltrace support
    [PATCH] for_each_possible_cpu: ppc
    [PATCH] for_each_possible_cpu: powerpc
    [PATCH] lock PTE before updating it in 440/BookE page fault handler
    [PATCH] powerpc: Kill _machine and hard-coded platform numbers
    ppc: Fix compile error in arch/ppc/lib/strcase.c
    [PATCH] git-powerpc: WARN was a dumb idea
    [PATCH] powerpc: a couple of trivial compile warning fixes
    powerpc: remove OCP references
    powerpc: Make uImage default build output for MPC8540 ADS
    powerpc: move math-emu over to arch/powerpc
    powerpc: use memparse() for mem= command line parsing
    ppc: fix strncasecmp prototype
    [PATCH] powerpc: make ISA floppies work again
    [PATCH] powerpc: Fix some initcall return values
    [PATCH] powerpc: Workaround for pSeries RTAS bug
    [PATCH] spufs: fix __init/__exit annotations
    [PATCH] powerpc: add hvc backend for rtas
    ...

    Linus Torvalds
     

29 Mar, 2006

5 commits

  • Move 'tsk->sighand = NULL' from cleanup_sighand() to __exit_signal(). This
    makes the exit path more understandable and allows us to do
    cleanup_sighand() outside of ->siglock protected section.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch kills PIDTYPE_TGID pid_type thus saving one hash table in
    kernel/pid.c and speeding up subthreads create/destroy a bit. It is also a
    preparation for the further tref/pids rework.

    This patch adds 'struct list_head thread_group' to 'struct task_struct'
    instead.

    We don't detach group leader from PIDTYPE_PID namespace until another
    thread inherits it's ->pid == ->tgid, so we are safe wrt premature
    free_pidmap(->tgid) call.

    Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID).
    Should the need arise, we can use find_task_by_pid()->group_leader.

    Signed-off-by: Oleg Nesterov
    Acked-By: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • __exit_signal() is private to release_task() now. I think it is better to
    make it static in kernel/exit.c and export flush_sigqueue() instead - this
    function is much more simple and straightforward.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cosmetic, rename __exit_sighand to cleanup_sighand and move it close to
    copy_sighand().

    This matches copy_signal/cleanup_signal naming, and I think it is easier to
    follow.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Acked-by: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • __exit_signal() does important cleanups atomically under ->siglock. It is
    also called from copy_process's error path. This is not good, for example we
    can't move __unhash_process() under ->siglock for that reason.

    We should not mix these 2 paths, just look at ugly 'if (p->sighand)' under
    'bad_fork_cleanup_sighand:' label. For copy_process() case it is sufficient
    to just backout copy_signal(), nothing more.

    Again, nobody can see this task yet. For CLONE_THREAD case we just decrement
    signal->count, otherwise nobody can see this ->signal and we can free it
    lockless.

    This patch assumes it is safe to do exit_thread_group_keys() without
    tasklist_lock.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Acked-by: David Howells
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov