01 Oct, 2006

6 commits

  • On systems running with virtual cpus there is optimization potential in
    regard to spinlocks and rw-locks. If the virtual cpu that has taken a lock
    is known to a cpu that wants to acquire the same lock it is beneficial to
    yield the timeslice of the virtual cpu in favour of the cpu that has the
    lock (directed yield).

    With CONFIG_PREEMPT="n" this can be implemented by the architecture without
    common code changes. Powerpc already does this.

    With CONFIG_PREEMPT="y" the lock loops are coded with _raw_spin_trylock,
    _raw_read_trylock and _raw_write_trylock in kernel/spinlock.c. If the lock
    could not be taken cpu_relax is called. A directed yield is not possible
    because cpu_relax doesn't know anything about the lock. To be able to
    yield the lock in favour of the current lock holder variants of cpu_relax
    for spinlocks and rw-locks are needed. The new _raw_spin_relax,
    _raw_read_relax and _raw_write_relax primitives differ from cpu_relax
    insofar that they have an argument: a pointer to the lock structure.

    Signed-off-by: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Haavard Skinnemoen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Fix up kernel/sys.c to be consistent with CodingStyle and the rest of the
    file.

    Signed-off-by: Cal Peake
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cal Peake
     
  • Add infrastructure to track "maximum allowable latency" for power saving
    policies.

    The reason for adding this infrastructure is that power management in the
    idle loop needs to make a tradeoff between latency and power savings
    (deeper power save modes have a longer latency to running code again). The
    code that today makes this tradeoff just does a rather simple algorithm;
    however this is not good enough: There are devices and use cases where a
    lower latency is required than that the higher power saving states provide.
    An example would be audio playback, but another example is the ipw2100
    wireless driver that right now has a very direct and ugly acpi hook to
    disable some higher power states randomly when it gets certain types of
    error.

    The proposed solution is to have an interface where drivers can

    * announce the maximum latency (in microseconds) that they can deal with
    * modify this latency
    * give up their constraint

    and a function where the code that decides on power saving strategy can
    query the current global desired maximum.

    This patch has a user of each side: on the consumer side, ACPI is patched
    to use this, on the producer side the ipw2100 driver is patched.

    A generic maximum latency is also registered of 2 timer ticks (more and you
    lose accurate time tracking after all).

    While the existing users of the patch are x86 specific, the infrastructure
    is not. I'd like to ask the arch maintainers of other architectures if the
    infrastructure is generic enough for their use (assuming the architecture
    has such a tradeoff as concept at all), and the sound/multimedia driver
    owners to look at the driver facing API to see if this is something they
    can use.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Acked-by: Jesse Barnes
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Make it possible to disable the block layer. Not all embedded devices require
    it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
    the block layer to be present.

    This patch does the following:

    (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
    support.

    (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
    an item that uses the block layer. This includes:

    (*) Block I/O tracing.

    (*) Disk partition code.

    (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

    (*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
    block layer to do scheduling. Some drivers that use SCSI facilities -
    such as USB storage - end up disabled indirectly from this.

    (*) Various block-based device drivers, such as IDE and the old CDROM
    drivers.

    (*) MTD blockdev handling and FTL.

    (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
    taking a leaf out of JFFS2's book.

    (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
    linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
    however, still used in places, and so is still available.

    (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
    parts of linux/fs.h.

    (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

    (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

    (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
    is not enabled.

    (*) fs/no-block.c is created to hold out-of-line stubs and things that are
    required when CONFIG_BLOCK is not set:

    (*) Default blockdev file operations (to give error ENODEV on opening).

    (*) Makes some /proc changes:

    (*) /proc/devices does not list any blockdevs.

    (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

    (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

    (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
    given command other than Q_SYNC or if a special device is specified.

    (*) In init/do_mounts.c, no reference is made to the blockdev routines if
    CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.

    (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
    error ENOSYS by way of cond_syscall if so).

    (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
    CONFIG_BLOCK is not set, since they can't then happen.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells
     
  • Create a new header file, fs/internal.h, for common definitions local to the
    sources in the fs/ directory.

    Move extern definitions that should be in header files from fs/*.c to
    fs/internal.h or other main header files where they span directories.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells
     
  • Remove the duplicate declaration of exit_io_context() from linux/sched.h.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells
     

30 Sep, 2006

34 commits

  • Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Use prototypes in headers
    Don't define panic_on_unrecovered_nmi for all architectures

    Cc: dzickus@redhat.com

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Fix obscure race condition in kernel/cpuset.c attach_task() code.

    There is basically zero chance of anyone accidentally being harmed by this
    race.

    It requires a special 'micro-stress' load and a special timing loop hacks
    in the kernel to hit in less than an hour, and even then you'd have to hit
    it hundreds or thousands of times, followed by some unusual and senseless
    cpuset configuration requests, including removing the top cpuset, to cause
    any visibly harm affects.

    One could, with perhaps a few days or weeks of such effort, get the
    reference count on the top cpuset below zero, and manage to crash the
    kernel by asking to remove the top cpuset.

    I found it by code inspection.

    The race was introduced when 'the_top_cpuset_hack' was introduced, and one
    piece of code was not updated. An old check for a possibly null task
    cpuset pointer needed to be changed to a check for a task marked
    PF_EXITING. The pointer can't be null anymore, thanks to
    the_top_cpuset_hack (documented in kernel/cpuset.c). But the task could
    have gone into PF_EXITING state after it was found in the task_list scan.

    If a task is PF_EXITING in this code, it is possible that its task->cpuset
    pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
    than because the top_cpuset was that tasks last valid cpuset. In that
    case, the wrong cpuset reference counter would be decremented.

    The fix is trivial. Instead of failing the system call if the tasks cpuset
    pointer is null here, fail it if the task is in PF_EXITING state.

    The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
    the top_cpuset is done without locking, so could happen at anytime. But it
    is done during the exit handling, after the PF_EXITING flag is set. So if
    we verify that a task is still not PF_EXITING after we copy out its cpuset
    pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
    hack references to the top_cpuset.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • With CONFIG_DEBUG_LOCK_ALLOC turned off i was getting sporadic failures in
    the locking self-test:

    ------------>
    | Locking API testsuite:
    ----------------------------------------------------------------------------
    | spin |wlock |rlock |mutex | wsem | rsem |
    --------------------------------------------------------------------------
    A-A deadlock: ok | ok | ok | ok | ok | ok |
    A-B-B-A deadlock: ok | ok | ok | ok | ok | ok |
    A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok |
    A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok |
    A-B-B-C-C-D-D-A deadlock: ok |FAILED| ok | ok | ok | ok |
    A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok |
    A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok |FAILED|

    after much debugging it turned out to be caused by accidental chain-hash
    key collisions. The current hash is:

    #define iterate_chain_key(key1, key2) \
    (((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \
    ((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \
    (key2))

    where MAX_LOCKDEP_KEYS_BITS is 11. This hash is pretty good as it will
    shift by 5 bits in every iteration, where every new ID 'mixed' into the
    hash would have up to 11 bits. But because there was a 6 bits overlap
    between subsequent IDs and their high bits tended to be similar, there was
    a chance for accidental chain-hash collision for a low number of locks
    held.

    the solution is to shift by 11 bits:

    #define iterate_chain_key(key1, key2) \
    (((key1) << MAX_LOCKDEP_KEYS_BITS) ^ \
    ((key1) >> (64-MAX_LOCKDEP_KEYS_BITS)) ^ \
    (key2))

    This keeps the hash perfect up to 5 locks held, but even above that the
    hash is still good because 11 bits is a relative prime to the total 64
    bits, so a complete match will only occur after 64 held locks (which doesnt
    happen in Linux). Even after 5 locks held, entropy of the 5 IDs mixed into
    the hash is already good enough so that overlap doesnt generate a colliding
    hash ID.

    with this change the false positives went away.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Add tty locking around the audit and accounting code.

    The whole current->signal-> locking is all deeply strange but it's for
    someone else to sort out. Add rather than replace the lock for acct.c

    Signed-off-by: Alan Cox
    Acked-by: Arjan van de Ven
    Cc: Al Viro
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • I had to look back: this code was extracted from the module.c code in 2005.

    Signed-off-by: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • I've been using systemtap for some debugging and I noticed that it can't
    probe a lot of modules. Turns out it's kind of silly, the sections section
    of /sys/module is limited to 32byte filenames and many of the actual
    sections are a a bit longer than that.

    [akpm@osdl.org: rewrite to use dymanic allocation]
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian S. Nelson
     
  • The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
    it could remove a CPU or Node from the top cpuset, while leaving it still
    in some child cpusets.

    One basic rule of cpusets is that each cpusets cpus and mems are subsets of
    its parents. The cpuset hot unplug code violated this rule.

    So the cpuset hotunplug handler must walk down the tree, removing any
    removed CPU or Node from all cpusets.

    However, it is not allowed to make a cpusets cpus or mems become empty.
    They can only transition from empty to non-empty, not back.

    So if the last CPU or Node would be removed from a cpuset by the above
    walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
    that still has something online, and copy its CPU or Memory placement.

    Signed-off-by: Paul Jackson
    Cc: Nathan Lynch
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Change the list of memory nodes allowed to tasks in the top (root) nodeset
    to dynamically track what cpus are online, using a call to a cpuset hook
    from the memory hotplug code. Make this top cpus file read-only.

    On systems that have cpusets configured in their kernel, but that aren't
    actively using cpusets (for some distros, this covers the majority of
    systems) all tasks end up in the top cpuset.

    If that system does support memory hotplug, then these tasks cannot make
    use of memory nodes that are added after system boot, because the memory
    nodes are not allowed in the top cpuset. This is a surprising regression
    over earlier kernels that didn't have cpusets enabled.

    One key motivation for this change is to remain consistent with the
    behaviour for the top_cpuset's 'cpus', which is also read-only, and which
    automatically tracks the cpu_online_map.

    This change also has the minor benefit that it fixes a long standing,
    little noticed, minor bug in cpusets. The cpuset performance tweak to
    short circuit the cpuset_zone_allowed() check on systems with just a single
    cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
    changing the 'mems' of the top_cpuset had no affect, even though the change
    (the write system call) appeared to succeed. With the following change,
    that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
    refuses to be changed via user space writes. Thus no one should be mislead
    into thinking they've changed the top_cpusets's 'mems' when in affect they
    haven't.

    In order to keep the behaviour of cpusets consistent between systems
    actively making use of them and systems not using them, this patch changes
    the behaviour of the 'mems' file in the top (root) cpuset, making it read
    only, and making it automatically track the value of node_online_map. Thus
    tasks in the top cpuset will have automatic use of hot plugged memory nodes
    allowed by their cpuset.

    [akpm@osdl.org: build fix]
    [bunk@stusta.de: build fix]
    Signed-off-by: Paul Jackson
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • I am not sure about this patch, I am asking Ingo to take a decision.

    task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion
    it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should
    live only in ->exit_state as documented in sched.h.

    Note that this state is not visible to user-space, get_task_state() masks off
    unsuitable states.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • After the previous change (->flags & PF_DEAD) (->state == EXIT_DEAD), we
    don't need PF_DEAD any longer.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD
    to ensure that the exiting task will be deactivated. Note that this EXIT_DEAD
    is in fact a "random" value, we can use any bit except normal TASK_XXX values.

    It is better to set this state in do_exit() along with PF_DEAD flag and remove
    that check in schedule().

    We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it
    can not change task's ->state: the 'state' argument of try_to_wake_up() can't
    have EXIT_DEAD bit. And in case when try_to_wake_up() sees a stale value of
    ->state == TASK_RUNNING it will do nothing.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • use rcu locks for find_task_by_pid().

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • It is ok to do find_task_by_pid() + get_task_struct() under
    rcu_read_lock(), we cand drop tasklist_lock.

    Note that testing of ->exit_state is racy with or without tasklist anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • copy_process:
    // holds tasklist_lock + ->siglock
    /*
    * inherit ioprio
    */
    p->ioprio = current->ioprio;

    Why? ->ioprio was already copied in dup_task_struct(). I guess this is
    needed to ensure that the child can't escape
    sys_ioprio_set(IOPRIO_WHO_{PGRP,USER}), yes?

    In that case we don't need ->siglock held, and the comment should be
    updated.

    Signed-off-by: Oleg Nesterov
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Remove open-coded has_rt_policy(), no changes in kernel/exit.o

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • I am not sure this patch is correct: I can't understand what the current
    code does, and I don't know what it was supposed to do.

    The comment says:

    * can't change policy, except between SCHED_NORMAL
    * and SCHED_BATCH:

    The code:

    if (((policy != SCHED_NORMAL && p->policy != SCHED_BATCH) &&
    (policy != SCHED_BATCH && p->policy != SCHED_NORMAL)) &&

    But this is equivalent to:

    if ( (is_rt_policy(policy) && has_rt_policy(p)) &&

    which means something different. We can't _decrease_ the current
    ->rt_priority with such a check (if rlim[RLIMIT_RTPRIO] == 0).

    Probably, it was supposed to be:

    if ( !(policy == SCHED_NORMAL && p->policy == SCHED_BATCH) &&
    !(policy == SCHED_BATCH && p->policy == SCHED_NORMAL)

    this matches the comment, but strange: it doesn't allow to _drop_ the
    realtime priority when rlim[RLIMIT_RTPRIO] == 0.

    I think the right check would be:

    /* can't set/change rt policy */
    if (is_rt_policy(policy) &&
    policy != p->policy &&
    !rlim_rtprio)
    return -EPERM;

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Imho, makes the code a bit easier to read.

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Use rcu locks instead. sched_setscheduler() now takes ->siglock
    before reading ->signal->rlim[].

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Get rid of an extraneous printk in kernel_restart().

    Signed-off-by: Cal Peake
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cal Peake
     
  • If ____call_usermodehelper fails, we're not interested in the child
    process' exit value, but the real error, so let's stop wait_for_helper from
    overwriting it in that case.

    Issue discovered by Benedikt Böhm while working on a Linux-VServer usermode
    helper.

    Signed-off-by: Björn Steinbrink
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Björn Steinbrink
     
  • If we are going to BUG() not panic() here then we should cover the case of
    the BUG being compiled out

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • This fixes a couple of compiler warnings, and adds paranoia checks as well.

    Signed-off-by: Roland McGrath
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Pass ticks to do_timer() and update_times(), and adjust x86_64 and s390
    timer interrupt handler with this change.

    Currently update_times() calculates ticks by "jiffies - wall_jiffies", but
    callers of do_timer() should know how many ticks to update. Passing ticks
    get rid of this redundant calculation. Also there are another redundancy
    pointed out by Martin Schwidefsky.

    This cleanup make a barrier added by
    5aee405c662ca644980c184774277fc6d0769a84 needless. So this patch removes
    it.

    As a bonus, this cleanup make wall_jiffies can be removed easily, since now
    wall_jiffies is always synced with jiffies. (This patch does not really
    remove wall_jiffies. It would be another cleanup patch)

    Signed-off-by: Atsushi Nemoto
    Cc: Martin Schwidefsky
    Cc: "Eric W. Biederman"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Acked-by: Russell King
    Cc: Ian Molton
    Cc: Mikael Starvik
    Acked-by: David Howells
    Cc: Yoshinori Sato
    Cc: Hirokazu Takata
    Acked-by: Ralf Baechle
    Cc: Kyle McMartin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Acked-by: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • This tightens up __dequeue_signal a little. It also avoids doing
    recalc_sigpending twice in a row, instead doing it once in dequeue_signal.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This check has been obsolete since the introduction of TASK_TRACED. Now
    TASK_STOPPED always means job control stop.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • When a posix_cpu_nsleep() sleep is interrupted by a signal more than twice, it
    incorrectly reports the sleep time remaining to the user. Because
    posix_cpu_nsleep() doesn't report back to the user when it's called from
    restart function due to the wrong flags handling.

    This patch, which applies after previous one, moves the nanosleep() function
    from posix_cpu_nsleep() to do_cpu_nanosleep() and cleans up the flags handling
    appropriately.

    Signed-off-by: Toyo Abe
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toyo Abe
     
  • The clock_nanosleep() function does not return the time remaining when the
    sleep is interrupted by a signal.

    This patch creates a new call out, compat_clock_nanosleep_restart(), which
    handles returning the remaining time after a sleep is interrupted. This
    patch revives clock_nanosleep_restart(). It is now accessed via the new
    call out. The compat_clock_nanosleep_restart() is used for compatibility
    access.

    Since this is implemented in compatibility mode the normal path is
    virtually unaffected - no real performance impact.

    Signed-off-by: Toyo Abe
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toyo Abe
     
  • Spawing ksoftirqd, migration, or watchdog, and calling init_timers_cpu()
    may fail with small memory. If it happens in initcalls, kernel NULL
    pointer dereference happens later. This patch makes crash happen
    immediately in such cases. It seems a bit better than getting kernel NULL
    pointer dereference later.

    Cc: Ingo Molnar
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Both __kfifo_put() and __kfifo_get() have header comments stating that if
    there is but one concurrent reader and one concurrent writer, locking is not
    necessary. This is almost the case, but a couple of memory barriers are
    needed. Another option would be to change the header comments to remove the
    bit about locking not being needed, and to change the those callers who
    currently don't use locking to add the required locking. The attachment
    analyzes this approach, but the patch below seems simpler.

    Signed-off-by: Paul E. McKenney
    Cc: Stelian Pop
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • Lets do the same thing we do for oopses - print out the version in the
    report. It's an extra line of output though. We could tack it on the end
    of the INFO: lines, but that screws up Ingo's pretty output.

    Signed-off-by: Dave Jones
    Cc: Ingo Molnar
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • This is an updated version of Eric Biederman's is_init() patch.
    (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and
    replaces a few more instances of ->pid == 1 with is_init().

    Further, is_init() checks pid and thus removes dependency on Eric's other
    patches for now.

    Eric's original description:

    There are a lot of places in the kernel where we test for init
    because we give it special properties. Most significantly init
    must not die. This results in code all over the kernel test
    ->pid == 1.

    Introduce is_init to capture this case.

    With multiple pid spaces for all of the cases affected we are
    looking for only the first process on the system, not some other
    process that has pid == 1.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc:
    Acked-by: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Fixed race on put_files_struct on exec with proc. Restoring files on
    current on error path may lead to proc having a pointer to already kfree-d
    files_struct.

    ->files changing at exit.c and khtread.c are safe as exit_files() makes all
    things under lock.

    Found during OpenVZ stress testing.

    [akpm@osdl.org: add export]
    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • Fix "variable defined but not used" compiler warning in unwind.c when
    CONFIG_MODULES is not set.

    Signed-off-by: Chuck Ebbert
    Cc: Jan Beulich
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chuck Ebbert