17 Oct, 2007

40 commits

  • Control the trigger limit for softlockup warnings. This is useful for
    debugging softlockups, by lowering the softlockup_thresh to identify
    possible softlockups earlier.

    This patch:
    1. Adds a sysctl softlockup_thresh with valid values of 1-60s
    (Higher value to disable false positives)
    2. Changes the softlockup printk to print the cpu softlockup time

    [akpm@linux-foundation.org: Fix various warnings and add definition of "two"]
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • kernel/softirq.c grew a few style uncleanlinesses in the past few
    months, clean that up. No functional changes:

    text data bss dec hex filename
    1126 76 4 1206 4b6 softlockup.o.before
    1129 76 4 1209 4b9 softlockup.o.after

    ( the 3 bytes .text increase is due to the "" appended to one of
    the printk messages. )

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Improve the debuggability of kernel lockups by enhancing the debug
    output of the softlockup detector: print the task that causes the lockup
    and try to print a more intelligent backtrace.

    The old format was:

    BUG: soft lockup detected on CPU#1!
    [] show_trace_log_lvl+0x19/0x2e
    [] show_trace+0x12/0x14
    [] dump_stack+0x14/0x16
    [] softlockup_tick+0xbe/0xd0
    [] run_local_timers+0x12/0x14
    [] update_process_times+0x3e/0x63
    [] tick_sched_timer+0x7c/0xc0
    [] hrtimer_interrupt+0x135/0x1ba
    [] smp_apic_timer_interrupt+0x6e/0x80
    [] apic_timer_interrupt+0x33/0x38
    [] syscall_call+0x7/0xb
    =======================

    The new format is:

    BUG: soft lockup detected on CPU#1! [prctl:2363]

    Pid: 2363, comm: prctl
    EIP: 0060:[] CPU: 1
    EIP is at sys_prctl+0x24/0x18c
    EFLAGS: 00000213 Not tainted (2.6.22-cfs-v20 #26)
    EAX: 00000001 EBX: 000003e7 ECX: 00000001 EDX: f6df0000
    ESI: 000003e7 EDI: 000003e7 EBP: f6df0fb0 DS: 007b ES: 007b FS: 00d8
    CR0: 8005003b CR2: 4d8c3340 CR3: 3731d000 CR4: 000006d0
    [] show_trace_log_lvl+0x19/0x2e
    [] show_trace+0x12/0x14
    [] show_regs+0x1ab/0x1b3
    [] softlockup_tick+0xef/0x108
    [] run_local_timers+0x12/0x14
    [] update_process_times+0x3e/0x63
    [] tick_sched_timer+0x7c/0xc0
    [] hrtimer_interrupt+0x135/0x1ba
    [] smp_apic_timer_interrupt+0x6e/0x80
    [] apic_timer_interrupt+0x33/0x38
    [] syscall_call+0x7/0xb
    =======================

    Note that in the old format we only knew that some system call locked
    up, we didnt know _which_. With the new format we know that it's at a
    specific place in sys_prctl(). [which was where i created an artificial
    kernel lockup to test the new format.]

    This is also useful if the lockup happens in user-space - the user-space
    EIP (and other registers) will be printed too. (such a lockup would
    either suggest that the task was running at SCHED_FIFO:99 and looping
    for more than 10 seconds, or that the softlockup detector has a
    false-positive.)

    The task name is printed too first, just in case we dont manage to print
    a useful backtrace.

    [satyam@infradead.org: fix warning]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Satyam Sharma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • The softlockup detector would like to use get_irq_regs(), so generalize the
    availability on every Linux architecture.

    (It is fine for an architecture to always return NULL to get_irq_regs(),
    which it does by default.)

    Signed-off-by: Ingo Molnar
    Cc: Ian Molton
    Cc: Kumar Gala
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Mikael Starvik
    Cc: Miles Bader
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • this Xen related commit:

    commit 966812dc98e6a7fcdf759cbfa0efab77500a8868
    Author: Jeremy Fitzhardinge
    Date: Tue May 8 00:28:02 2007 -0700

    Ignore stolen time in the softlockup watchdog

    broke the softlockup watchdog to never report any lockups. (!)

    print_timestamp defaults to 0, this makes the following condition
    always true:

    if (print_timestamp < (touch_timestamp + 1) ||

    and we'll in essence never report soft lockups.

    apparently the functionality of the soft lockup watchdog was never
    actually tested with that patch applied ...

    Signed-off-by: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • sched_clock() is not a reliable time-source, use cpu_clock() instead.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Remove the rmb() from mce_log(), since the immunized version of
    rcu_dereference() makes it unnecessary.

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • Turns out that compiler writers are a bit more aggressive about optimizing
    than one might expect. This patch prevents a number of such optimizations
    from messing up rcu_deference(). This is not merely a theoretical problem, as
    evidenced by the rmb() in mce_log().

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Acked-by: Josh Triplett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • list_del() hardly can fail, so checking for return value is pointless
    (and current code always return 0).

    Nobody really cared that return value anyway.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Switch single-linked binfmt formats list to usual list_head's. This leads
    to one-liners in register_binfmt() and unregister_binfmt(). The downside
    is one pointer more in struct linux_binfmt. This is not a problem, since
    the set of registered binfmts on typical box is very small -- (ELF +
    something distro enabled for you).

    Test-booted, played with executable .txt files, modprobe/rmmod binfmt_misc.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • - remove the following no longer used functions:
    - bitmap.c: reiserfs_claim_blocks_to_be_allocated()
    - bitmap.c: reiserfs_release_claimed_blocks()
    - bitmap.c: reiserfs_can_fit_pages()

    - make the following functions static:
    - inode.c: restart_transaction()
    - journal.c: reiserfs_async_progress_wait()

    Signed-off-by: Adrian Bunk
    Acked-by: Vladimir V. Saveliev
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This is a writeback-internal marker but we're propagating it all the way back
    to userspace!.

    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • zone->lock is quite an "inner" lock and mostly constrained to page alloc as
    well, so like slab locks, it probably isn't something that is critically
    important to document here. However unlike slab locks, zone lock could be
    used more widely in future, and page_alloc.c might possibly have more
    business to do tricky things with pagecache than does slab. So... I don't
    think it hurts to document it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Introduces new zone flag interface for testing and setting flags:

    int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)

    Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
    called, this flag is test and set before starting zone reclaim. Zone reclaim
    starts in __alloc_pages() when a zone's watermark fails and the system is in
    zone_reclaim_mode. If it's already in reclaim, there's no need to start again
    so it is simply considered full for that allocation attempt.

    There is a change of behavior with regard to concurrent zone shrinking. It is
    now possible for try_to_free_pages() or kswapd to already be shrinking a
    particular zone when __alloc_pages() starts zone reclaim. In this case, it is
    possible for two concurrent threads to invoke shrink_zone() for a single zone.

    This change forbids a zone to be in zone reclaim twice, which was always the
    behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
    when starting zone reclaim.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if
    the lock can't be acquired; it will be available soon enough once the zonelist
    scanning is done. All other threads waiting for the OOM killer are also
    contingent on the exiting task being able to acquire the lock in
    clear_zonelist_oom() so it doesn't make sense to put it to sleep.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Preprocess include/linux/oom.h before exporting it to userspace.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc: Alexey Dobriyan
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It's not necessary to include all of linux/sched.h in linux/oom.h. Instead,
    simply include prototypes for the relevant structs and include linux/types.h
    for gfp_t.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Acked-by: Alexey Dobriyan
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since no task descriptor's 'cpuset' field is dereferenced in the execution of
    the OOM killer anymore, it is no longer necessary to take callback_mutex.

    [akpm@linux-foundation.org: restore cpuset_lock for other patches]
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Instead of testing for overlap in the memory nodes of the the nearest
    exclusive ancestor of both current and the candidate task, it is better to
    simply test for intersection between the task's mems_allowed in their task
    descriptors. This does not require taking callback_mutex since it is only
    used as a hint in the badness scoring.

    Tasks that do not have an intersection in their mems_allowed with the current
    task are not explicitly restricted from being OOM killed because it is quite
    possible that the candidate task has allocated memory there before and has
    since changed its mems_allowed.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Suppresses the extraneous stack and memory dump when a parallel OOM killing
    has been found. There's no need to fill the ring buffer with this information
    if its already been printed and the condition that triggered the previous OOM
    killer has not yet been alleviated.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill
    the OOM-triggering task instead of scanning through the tasklist to find a
    memory-hogging target. This is helpful for systems with an insanely large
    number of tasks where scanning the tasklist significantly degrades
    performance.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • A final allocation attempt with a very high watermark needs to be attempted
    before invoking out_of_memory(). OOM killer serialization needs to occur
    before this final attempt, otherwise tasks attempting to OOM-lock all zones in
    its zonelist may spin and acquire the lock unnecessarily after the OOM
    condition has already been alleviated.

    If the final allocation does succeed, the zonelist is simply OOM-unlocked and
    __alloc_pages() returns the page. Otherwise, the OOM killer is invoked.

    If the task cannot acquire OOM-locks on all zones in its zonelist, it is put
    to sleep and the allocation is retried when it gets rescheduled. One of its
    zones is already marked as being in the OOM killer so it'll hopefully be
    getting some free memory soon, at least enough to satisfy a high watermark
    allocation attempt. This prevents needlessly killing a task when the OOM
    condition would have already been alleviated if it had simply been given
    enough time.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • OOM killer synchronization should be done with zone granularity so that memory
    policy and cpuset allocations may have their corresponding zones locked and
    allow parallel kills for other OOM conditions that may exist elsewhere in the
    system. DMA allocations can be targeted at the zone level, which would not be
    possible if locking was done in nodes or globally.

    Synchronization shall be done with a variation of "trylocks." The goal is to
    put the current task to sleep and restart the failed allocation attempt later
    if the trylock fails. Otherwise, the OOM killer is invoked.

    Each zone in the zonelist that __alloc_pages() was called with is checked for
    the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present,
    the "trylock" to serialize the OOM killer fails and returns zero. Otherwise,
    all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function
    returns non-zero.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Convert the int all_unreclaimable member of struct zone to unsigned long
    flags. This can now be used to specify several different zone flags such as
    all_unreclaimable and reclaim_in_progress, which can now be removed and
    converted to a per-zone flag.

    Flags are set and cleared as follows:

    zone_set_flag(struct zone *zone, zone_flags_t flag)
    zone_clear_flag(struct zone *zone, zone_flags_t flag)

    Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
    which have the same semantics as the old zone->all_unreclaimable and
    zone->reclaim_in_progress, respectively. Also converts all current users that
    set or clear either flag to use the new interface.

    Helper functions are defined to test the flags:

    int zone_is_all_unreclaimable(const struct zone *zone)
    int zone_is_reclaim_locked(const struct zone *zone)

    All flag operators are of the atomic variety because there are currently
    readers that are implemented that do not take zone->lock.

    [akpm@linux-foundation.org: add needed include]
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The OOM killer's CONSTRAINT definitions are really more appropriate in an
    enum, so define them in include/linux/oom.h.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Move the OOM killer's extern function prototypes to include/linux/oom.h and
    include it where necessary.

    [clg@fr.ibm.com: build fix]
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Slab constructors currently have a flags parameter that is never used. And
    the order of the arguments is opposite to other slab functions. The object
    pointer is placed before the kmem_cache pointer.

    Convert

    ctor(void *object, struct kmem_cache *s, unsigned long flags)

    to

    ctor(struct kmem_cache *s, void *object)

    throughout the kernel

    [akpm@linux-foundation.org: coupla fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Move irq handling out of new slab into __slab_alloc. That is useful for
    Mathieu's cmpxchg_local patchset and also allows us to remove the crude
    local_irq_off in early_kmem_cache_alloc().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Based on ideas of Andrew:
    http://marc.info/?l=linux-kernel&m=102912915020543&w=2

    Scale the bdi dirty limit inversly with the tasks dirty rate.
    This makes heavy writers have a lower dirty limit than the occasional writer.

    Andrea proposed something similar:
    http://lwn.net/Articles/152277/

    The main disadvantage to his patch is that he uses an unrelated quantity to
    measure time, which leaves him with a workload dependant tunable. Other than
    that the two approaches appear quite similar.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Scale writeback cache per backing device, proportional to its writeout speed.

    By decoupling the BDI dirty thresholds a number of problems we currently have
    will go away, namely:

    - mutual interference starvation (for any number of BDIs);
    - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

    It might be that all dirty pages are for a single BDI while other BDIs are
    idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
    dirty pages outstanding and make progress.

    A global threshold also creates a deadlock for stacked BDIs; when A writes to
    B, and A generates enough dirty pages to get throttled, B will never start
    writeback until the dirty pages go away. Again, by giving each BDI its own
    'independent' dirty limit, this problem is avoided.

    So the problem is to determine how to distribute the total dirty limit across
    the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
    not have any dirty pages outstanding is a waste.

    What is done is to keep a floating proportion between the DBIs based on
    writeback completions. This way faster/more active devices get a larger share
    than slower/idle devices.

    [akpm@linux-foundation.org: fix warnings]
    [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Given a set of objects, floating proportions aims to efficiently give the
    proportional 'activity' of a single item as compared to the whole set. Where
    'activity' is a measure of a temporal property of the items.

    It is efficient in that it need not inspect any other items of the set
    in order to provide the answer. It is not even needed to know how many
    other items there are.

    It has one parameter, and that is the period of 'time' over which the
    'activity' is measured.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Count per BDI writeback pages.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Provide scalable per backing_dev_info statistics counters.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • provide BDI constructor/destructor hooks

    [akpm@linux-foundation.org: compile fix]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • provide a way to tell lockdep about percpu_counters that are supposed to be
    used from irq safe contexts.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • alloc_percpu can fail, propagate that error.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Provide an accurate version of percpu_counter_read.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • s/percpu_counter_sum/&_positive/

    Because its consitent with percpu_counter_read*

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Provide a method to set a percpu counter to a specified value.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra