24 Jun, 2005

4 commits

  • sys_timer_settime/sys_timer_delete needs to delete k_itimer->real.timer
    synchronously while holding ->it_lock, which is also locked in
    posix_timer_fn.

    This patch removes timer_active/set_timer_inactive which plays with
    timer_list's internals in favour of using try_to_del_timer_sync(), which
    was introduced in the previous patch.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch splits del_timer_sync() into 2 functions. The new one,
    try_to_del_timer_sync(), returns -1 when it hits executing timer.

    It can be used in interrupt context, or when the caller hold locks which
    can prevent completion of the timer's handler.

    NOTE. Currently it can't be used in interrupt context in UP case, because
    ->running_timer is used only with CONFIG_SMP.

    Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
    set_running_timer(), it is cheap.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch tries to solve following problems:

    1. del_timer_sync() is racy. The timer can be fired again after
    del_timer_sync have checked all cpus and before it will recheck
    timer_pending().

    2. It has scalability problems. All cpus are scanned to determine
    if the timer is running on that cpu.

    With this patch del_timer_sync is O(1) and no slower than plain
    del_timer(pending_timer), unless it has to actually wait for
    completion of the currently running timer.

    The only restriction is that the recurring timer should not use
    add_timer_on().

    3. The timers are not serialized wrt to itself.

    If CPU_0 does mod_timer(jiffies+1) while the timer is currently
    running on CPU 1, it is quite possible that local interrupt on
    CPU_0 will start that timer before it finished on CPU_1.

    4. The timers locking is suboptimal. __mod_timer() takes 3 locks
    at once and still requires wmb() in del_timer/run_timers.

    The new implementation takes 2 locks sequentially and does not
    need memory barriers.

    Currently ->base != NULL means that the timer is pending. In that case
    ->base.lock is used to lock the timer. __mod_timer also takes timer->lock
    because ->base can be == NULL.

    This patch uses timer->entry.next != NULL as indication that the timer is
    pending. So it does __list_del(), entry->next = NULL instead of list_del()
    when the timer is deleted.

    The ->base field is used for hashed locking only, it is initialized
    in init_timer() which sets ->base = per_cpu(tvec_bases). When the
    tvec_bases.lock is locked, it means that all timers which are tied
    to this base via timer->base are locked, and the base itself is locked
    too.

    So __run_timers/migrate_timers can safely modify all timers which could
    be found on ->tvX lists (pending timers).

    When the timer's base is locked, and the timer removed from ->entry list
    (which means that _run_timers/migrate_timers can't see this timer), it is
    possible to set timer->base = NULL and drop the lock: the timer remains
    locked.

    This patch adds lock_timer_base() helper, which waits for ->base != NULL,
    locks the ->base, and checks it is still the same.

    __mod_timer() schedules the timer on the local CPU and changes it's base.
    However, it does not lock both old and new bases at once. It locks the
    timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
    unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
    and adds this timer. This simplifies the code, because AB-BA deadlock is not
    possible. __mod_timer() also ensures that the timer's base is not changed
    while the timer's handler is running on the old base.

    __run_timers(), del_timer() do not change ->base anymore, they only clear
    pending flag.

    So del_timer_sync() can test timer->base->running_timer == timer to detect
    whether it is running or not.

    We don't need timer_list->lock anymore, this patch kills it.

    We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
    before clearing timer's pending flag. It was needed because __mod_timer()
    did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
    could race with del_timer()->list_del(). With this patch these functions are
    serialized through base->lock.

    One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
    adds global

    struct timer_base_s {
    spinlock_t lock;
    struct timer_list *running_timer;
    } __init_timer_base;

    which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
    struct are replaced by struct timer_base_s t_base.

    It is indeed ugly. But this can't have scalability problems. The global
    __init_timer_base.lock is used only when __mod_timer() is called for the first
    time AND the timer was compile time initialized. After that the timer migrates
    to the local CPU.

    Signed-off-by: Oleg Nesterov
    Acked-by: Ingo Molnar
    Signed-off-by: Renaud Lienhart
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Make the timer frequency selectable. The timer interrupt may cause bus
    and memory contention in large NUMA systems since the interrupt occurs
    on each processor HZ times per second.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Jun, 2005

6 commits

  • With Chris Wedgwood

    As suggested by Chris, we can make the "just added" method ->release
    conditional to UML only (better: to archs requesting it, i.e. only UML
    currently), so that other archs don't get this unneeded crud, and if UML
    won't need it any more we can kill this.

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    CC: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     
  • With Chris Wedgwood

    Currently UML must explicitly call the UML-specific
    free_irq_by_irq_and_dev() for each free_irq call it's done.

    This is needed because ->shutdown and/or ->disable are only called when the
    last "action" for that irq is removed.

    Instead, for UML shared IRQs (UML IRQs are very often, if not always,
    shared), for each dev_id some setup is done, which must be cleared on the
    release of that fd. For instance, for each open console a new instance
    (i.e. new dev_id) of the same IRQ is requested().

    Exactly, a fd is stored in an array (pollfds), which is after read by a
    host thread and passed to poll(). Each event registered by poll() triggers
    an interrupt. So, for each free_irq() we must remove the corresponding
    host fd from the table, which we do via this -release() method.

    In this patch we add an appropriate hook for this, and remove all uses of
    it by pointing the hook to the said procedure; this is safe to do since the
    said procedure.

    Also some cosmetic improvements are included.

    This is heavily based on some work by Chris Wedgwood, which however didn't
    get the patch merged for something I'd call a "misunderstanding" (the need
    for this patch wasn't cleanly explained, thus adding the generic hook was
    felt as undesirable).

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    CC: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     
  • Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap
    came in, try_to_unmap_one knows the vma without needing find_vma. But add
    a comment to note that here vma is inserted without mmap_sem.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Ingo recently introduced a great speedup for allocating new mmaps using the
    free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
    causes huge performance increases in thread creation.

    The downside of this patch is that it does lead to fragmentation in the
    mmap-ed areas (visible via /proc/self/maps), such that some applications
    that work fine under 2.4 kernels quickly run out of memory on any 2.6
    kernel.

    The problem is twofold:

    1) the free_area_cache is used to continue a search for memory where
    the last search ended. Before the change new areas were always
    searched from the base address on.

    So now new small areas are cluttering holes of all sizes
    throughout the whole mmap-able region whereas before small holes
    tended to close holes near the base leaving holes far from the base
    large and available for larger requests.

    2) the free_area_cache also is set to the location of the last
    munmap-ed area so in scenarios where we allocate e.g. five regions of
    1K each, then free regions 4 2 3 in this order the next request for 1K
    will be placed in the position of the old region 3, whereas before we
    appended it to the still active region 1, placing it at the location
    of the old region 2. Before we had 1 free region of 2K, now we only
    get two free regions of 1K -> fragmentation.

    The patch addresses thes issues by introducing yet another cache descriptor
    cached_hole_size that contains the largest known hole size below the
    current free_area_cache. If a new request comes in the size is compared
    against the cached_hole_size and if the request can be filled with a hole
    below free_area_cache the search is started from the base instead.

    The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
    (earlier posted) leakme.c test program terminates after 50000+ iterations
    with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
    (as expected) with thread creation, Ingo's test_str02 with 20000 threads
    requires 0.7s system time.

    Taking out Ingo's patch (un-patch available per request) by basically
    deleting all mentions of free_area_cache from the kernel and starting the
    search for new memory always at the respective bases we observe: leakme
    terminates successfully with 11 distinctive hardly fragmented areas in
    /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
    time for Ingo's test_str02 with 20000 threads.

    Now - drumroll ;-) the appended patch works fine with leakme: it ends with
    only 7 distinct areas in /proc/self/maps and also thread creation seems
    sufficiently fast with 0.71s for 20000 threads.

    Signed-off-by: Wolfgang Wander
    Credit-to: "Richard Purdie"
    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar (partly)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wolfgang Wander
     
  • This is the core of the (much simplified) early reclaim. The goal of this
    patch is to reclaim some easily-freed pages from a zone before falling back
    onto another zone.

    One of the major uses of this is NUMA machines. With the default allocator
    behavior the allocator would look for memory in another zone, which might be
    off-node, before trying to reclaim from the current zone.

    This adds a zone tuneable to enable early zone reclaim. It is selected on a
    per-zone basis and is turned on/off via syscall.

    Adding some extra throttling on the reclaim was also required (patch
    4/4). Without the machine would grind to a crawl when doing a "make -j"
    kernel build. Even with this patch the System Time is higher on
    average, but it seems tolerable. Here are some numbers for kernbench
    runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

    wall user sys %cpu ctx sw. sleeps
    ---- ---- --- ---- ------ ------
    No patch 1009 1384 847 258 298170 504402
    w/patch, no reclaim 880 1376 667 288 254064 396745
    w/patch & reclaim 1079 1385 926 252 291625 548873

    These numbers are the average of 2 runs of 3 "make -j" runs done right
    after system boot. Run-to-run variability for "make -j" is huge, so
    these numbers aren't terribly useful except to seee that with reclaim
    the benchmark still finishes in a reasonable amount of time.

    I also looked at the NUMA hit/miss stats for the "make -j" runs and the
    reclaim doesn't make any difference when the machine is thrashing away.

    Doing a "make -j8" on a single node that is filled with page cache pages
    takes 700 seconds with reclaim turned on and 735 seconds without reclaim
    (due to remote memory accesses).

    The simple zone_reclaim syscall program is at
    http://www.bork.org/~mort/sgi/zone_reclaim.c

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • This patch implements a number of smp_processor_id() cleanup ideas that
    Arjan van de Ven and I came up with.

    The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
    spaghetti was hard to follow both on the implementational and on the
    usage side.

    Some of the complexity arose from picking wrong names, some of the
    complexity comes from the fact that not all architectures defined
    __smp_processor_id.

    In the new code, there are two externally visible symbols:

    - smp_processor_id(): debug variant.

    - raw_smp_processor_id(): nondebug variant. Replaces all existing
    uses of _smp_processor_id() and __smp_processor_id(). Defined
    by every SMP architecture in include/asm-*/smp.h.

    There is one new internal symbol, dependent on DEBUG_PREEMPT:

    - debug_smp_processor_id(): internal debug variant, mapped to
    smp_processor_id().

    Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
    lib/smp_processor_id.c file. All related comments got updated and/or
    clarified.

    I have build/boot tested the following 8 .config combinations on x86:

    {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}

    I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT. (Other
    architectures are untested, but should work just fine.)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

21 Jun, 2005

1 commit


18 Jun, 2005

2 commits


14 Jun, 2005

1 commit

  • On one path, cond_resched_lock() fails to return true if it dropped the lock.
    We think this might be causing the crashes in JBD's log_do_checkpoint().

    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

02 Jun, 2005

1 commit


01 Jun, 2005

1 commit

  • flush_icache_range() is used in two different situation - in binfmt_elf.c &
    co for user space mappings and module.c for kernel modules. On m68k
    flush_icache_range() doesn't know which data to flush, as it has separate
    address spaces and the pointer argument can be valid in either address
    space.

    First I considered splitting flush_icache_range(), but this patch is
    simpler. Setting the correct context gives flush_icache_range() enough
    information to flush the correct data.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

29 May, 2005

1 commit

  • The "unhandled interrupts" catcher, note_interrupt(), increments a global
    desc->irq_count and grossly damages scaling of very large systems, e.g.,
    >192p ia64 Altix, because of this highly contented cacheline, especially
    for timer interrupts. 384p is severely crippled, and 512p is unuseable.

    All calls to note_interrupt() can be disabled by booting with "noirqdebug",
    but this disables the useful interrupt checking for all interrupts.

    I propose eliminating note_interrupt() for all per-CPU interrupts. This
    was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
    restructuring added a call to note_interrupt() for per-CPU interrupts.
    Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
    the desc->irq_count++ increment isn't atomic (which, if done, would make
    scaling even worse).

    Signed-off-by: John Hawkes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hawkes
     

27 May, 2005

2 commits

  • There is a race in the kernel cpuset code, between the code
    to handle notify_on_release, and the code to remove a cpuset.
    The notify_on_release code can end up trying to access a
    cpuset that has been removed. In the most common case, this
    causes a NULL pointer dereference from the routine cpuset_path.
    However all manner of bad things are possible, in theory at least.

    The existing code decrements the cpuset use count, and if the
    count goes to zero, processes the notify_on_release request,
    if appropriate. However, once the count goes to zero, unless we
    are holding the global cpuset_sem semaphore, there is nothing to
    stop another task from immediately removing the cpuset entirely,
    and recycling its memory.

    The obvious fix would be to always hold the cpuset_sem
    semaphore while decrementing the use count and dealing with
    notify_on_release. However we don't want to force a global
    semaphore into the mainline task exit path, as that might create
    a scaling problem.

    The actual fix is almost as easy - since this is only an issue
    for cpusets using notify_on_release, which the top level big
    cpusets don't normally need to use, only take the cpuset_sem
    for cpusets using notify_on_release.

    This code has been run for hours without a hiccup, while running
    a cpuset create/destroy stress test that could crash the existing
    kernel in seconds. This patch applies to the current -linus
    git kernel.

    Signed-off-by: Paul Jackson
    Acked-by: Simon Derr
    Acked-by: Dinakar Guniguntala
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Signed-off-by: David Woodhouse

    David Woodhouse
     

26 May, 2005

1 commit

  • While they were all just simple blobs it made sense to just free them
    as we walked through and logged them. Now that there are pointers to
    other objects which need refcounting, we might as well revert to
    _only_ logging them in audit_log_exit(), and put the code to free them
    properly in only one place -- in audit_free_aux().

    Signed-off-by: David Woodhouse
    ----------------------------------------------------------

    David Woodhouse
     

25 May, 2005

1 commit

  • If SIGKILL does not have priority, we cannot instantly kill task before it
    makes some unexpected job. It can be critical, but we were unable to
    reproduce this easily until Heiko Carstens
    reported this problem on LKML.

    Signed-Off-By: Kirill Korotaev
    Signed-Off-By: Alexey Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

24 May, 2005

2 commits


22 May, 2005

2 commits

  • Move audit_serial() into audit.c and use it to generate serial numbers
    on messages even when there is no audit context from syscall auditing.
    This allows us to disambiguate audit records when more than one is
    generated in the same millisecond.

    Based on a patch by Steve Grubb after he observed the problem.

    Signed-off-by: David Woodhouse

    David Woodhouse
     
  • In _spin_unlock_bh(lock):
    do { \
    _raw_spin_unlock(lock); \
    preempt_enable(); \
    local_bh_enable(); \
    __release(lock); \
    } while (0)

    there is no reason for using preempt_enable() instead of a simple
    preempt_enable_no_resched()

    Since we know bottom halves are disabled, preempt_schedule() will always
    return at once (preempt_count!=0), and hence preempt_check_resched() is
    useless here...

    This fixes it by using "preempt_enable_no_resched()" instead of the
    "preempt_enable()", and thus avoids the useless preempt_check_resched()
    just before re-enabling bottom halves.

    Signed-off-by: Samuel Thibault
    Signed-off-by: Linus Torvalds

    Samuel Thibault
     

21 May, 2005

4 commits

  • The attached patch changes all occurrences of loginuid to auid. It also
    changes everything to %u that is an unsigned type.

    Signed-off-by: Steve Grubb
    Signed-off-by: David Woodhouse

    Steve Grubb
     
  • The original AVC_USER message wasn't consolidated with the new range of
    user messages. The attached patch fixes the kernel so the old messages
    work again.

    Signed-off-by: Steve Grubb
    Signed-off-by: David Woodhouse

    Steve Grubb
     
  • This patch changes the SELinux AVC to defer logging of paths to the audit
    framework upon syscall exit, by saving a reference to the (dentry,vfsmount)
    pair in an auxiliary audit item on the current audit context for processing
    by audit_log_exit.

    Signed-off-by: Stephen Smalley
    Signed-off-by: David Woodhouse

    Stephen Smalley
     
  • This patch removes the entwining of cpusets and hotplug code in the "No
    more Mr. Nice Guy" case of sched.c move_task_off_dead_cpu().

    Since the hotplug code is holding a spinlock at this point, we cannot take
    the cpuset semaphore, cpuset_sem, as would seem to be required either to
    update the tasks cpuset, or to scan up the nested cpuset chain, looking for
    the nearest cpuset ancestor that still has some CPUs that are online. So
    we just punt and blast the tasks cpus_allowed with all bits allowed.

    This reverts these lines of code to what they were before the cpuset patch.
    And it updates the cpuset Doc file, to match.

    The one known alternative to this that seems to work came from Dinakar
    Guniguntala, and required the hotplug code to take the cpuset_sem semaphore
    much earlier in its processing. So far as we know, the increased locking
    entanglement between cpusets and hot plug of this alternative approach is
    not worth doing in this case.

    Signed-off-by: Paul Jackson
    Acked-by: Nathan Lynch
    Acked-by: Dinakar Guniguntala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

19 May, 2005

5 commits


18 May, 2005

2 commits

  • It's silly to have to add explicit entries for new userspace messages
    as we invent them. Just treat all messages in the user range the same.

    Signed-off-by: David Woodhouse

    David Woodhouse
     
  • This patch includes various tweaks in the messaging that appears during
    system pm state transitions:

    * Warn about certain illegal calls in the device tree, like resuming
    child before parent or suspending parent before child. This could
    happen easily enough through sysfs, or in some cases when drivers
    use device_pm_set_parent().

    * Be more consistent about dev_dbg() tracing ... do it for resume() and
    shutdown() too, and never if the driver doesn't have that method.

    * Say which type of system sleep state is being entered.

    Except for the warnings, these only affect debug messaging.

    Signed-off-by: David Brownell
    Acked-by: Pavel Machek
    Signed-off-by: Greg Kroah-Hartman

    David Brownell
     

17 May, 2005

4 commits

  • profile=schedule parsing is not quite what it should be. First, str[7] is
    'e', not ',', but then even if it did fall through, prof_on =
    SCHED_PROFILING would be clobbered inside if (get_option(...)) So a small
    amount of rearrangement is done in this patch to correct it.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    William Lee Irwin III
     
  • Move add_preferred_console out of CONFIG_PRINTK so serial console does the
    right thing.

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • On my IA64 machine, after kernel 2.6.12-rc3 boots, an edge-triggered
    interrupt (IRQ 46) keeps triggered over and over again. There is no IRQ 46
    interrupt action handler. It has lots of impact on performance.

    Kernel 2.6.10 and its prior versions have no the problem. Basically,
    kernel 2.6.10 will mask the spurious edge interrupt if the interrupt is
    triggered for the second time and its status includes
    IRQ_DISABLE|IRQ_PENDING.

    Originally, IA64 kernel has its own specific _irq_desc definitions in file
    arch/ia64/kernel/irq.c. The definition initiates _irq_desc[irq].status to
    IRQ_DISABLE. Since kernel 2.6.11, it was moved to architecture independent
    codes, i.e. kernel/irq/handle.c, but kernel/irq/handle.c initiates
    _irq_desc[irq].status to 0 instead of IRQ_DISABLE.

    Signed-off-by: Zhang Yanmin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang, Yanmin
     
  • Signed-off-by: David Woodhouse

    David Woodhouse