03 Feb, 2009

3 commits

  • Conflicts:
    drivers/net/Kconfig

    David S. Miller
     
  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched_rt: don't use first_cpu on cpumask created with cpumask_and
    sched: fix buddie group latency
    sched: clear buddies more aggressively
    sched: symmetric sync vs avg_overlap
    sched: fix sync wakeups
    cpuset: fix possible deadlock in async_rebuild_sched_domains

    Linus Torvalds
     
  • Current refcounting for modules (done if CONFIG_MODULE_UNLOAD=y) is
    using a lot of memory.

    Each 'struct module' contains an [NR_CPUS] array of full cache lines.

    This patch uses existing infrastructure (percpu_modalloc() &
    percpu_modfree()) to allocate percpu space for the refcount storage.

    Instead of wasting NR_CPUS*128 bytes (on i386), we now use
    nr_cpu_ids*sizeof(local_t) bytes.

    On a typical distro, where NR_CPUS=8, shiping 2000 modules, we reduce
    size of module files by about 2 Mbytes. (1Kb per module)

    Instead of having all refcounters in the same memory node - with TLB misses
    because of vmalloc() - this new implementation permits to have better
    NUMA properties, since each CPU will use storage on its preferred node,
    thanks to percpu storage.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Rusty Russell
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

01 Feb, 2009

10 commits


31 Jan, 2009

4 commits

  • Impact: prevent false positive WARN_ON() in clockevents_program_event()

    clock_was_set() changes the base->offset of CLOCK_REALTIME and
    enforces the reprogramming of the clockevent device to expire timers
    which are based on CLOCK_REALTIME. If the clock change is large enough
    then the subtraction of the timer expiry value and base->offset can
    become negative which triggers the warning in
    clockevents_program_event().

    Check the subtraction result and set a negative value to 0.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: fix CPU hotplug hang on Power6 testbox

    On architectures that support offlining all cpus (at least powerpc/pseries),
    hot-unpluging the tick_do_timer_cpu can result in a system hang.

    This comes from the fact that if the cpu going down happens to be the
    cpu doing the tick, then as the tick_do_timer_cpu handover happens after the
    cpu is dead (via the CPU_DEAD notification), we're left without ticks,
    jiffies are frozen and any task relying on timers (msleep, ...) is stuck.
    That's particularly the case for the cpu looping in __cpu_die() waiting
    for the dying cpu to be dead.

    This patch addresses this by having the tick_do_timer_cpu handover happen
    earlier during the CPU_DYING notification. For this, a new clockevent
    notification type is introduced (CLOCK_EVT_NOTIFY_CPU_DYING) which is triggered
    in hrtimer_cpu_notify().

    Signed-off-by: Sebastien Dugue
    Cc:
    Signed-off-by: Ingo Molnar

    Sebastien Dugue
     
  • Impact: avoid timer IRQ hanging slow systems

    While using the function graph tracer on a virtualized system, the
    hrtimer_interrupt can hang the system on an infinite loop.

    This can be caused in several situations:

    - the hardware is very slow and HZ is set too high

    - something intrusive is slowing the system down (tracing under emulation)

    ... and the next clock events to program are always before the current time.

    This patch implements a reasonable compromise: if such a situation is
    detected, we share the CPUs time in 1/4 to process the hrtimer interrupts.
    This is enough to let the system running without serious starvation.

    It has been successfully tested under VirtualBox with 1000 HZ and 100 HZ
    with function graph tracer launched. On both cases, the clock events were
    increased until about 25 ms periodic ticks, which means 40 HZ.

    So we change a hard to debug hang into a warning message and a system that
    still manages to limp along.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The smp_call_function can be passed a wait parameter telling it to
    wait for all the functions running on other CPUs to complete before
    returning, or to return without waiting. Unfortunately, this is
    currently just a suggestion and not manditory. That is, the
    smp_call_function can decide not to return and wait instead.

    The reason for this is because it uses kmalloc to allocate storage
    to send to the called CPU and that CPU will free it when it is done.
    But if we fail to allocate the storage, the stack is used instead.
    This means we must wait for the called CPU to finish before
    continuing.

    Unfortunatly, some callers do no abide by this hint and act as if
    the non-wait option is mandatory. The MTRR code for instance will
    deadlock if the smp_call_function is set to wait. This is because
    the smp_call_function will wait for the other CPUs to finish their
    called functions, but those functions are waiting on the caller to
    continue.

    This patch changes the generic smp_call_function code to use per cpu
    variables if the allocation of the data fails for a single CPU call. The
    smp_call_function_many will fall back to the smp_call_function_single
    if it fails its alloc. The smp_call_function_single is modified
    to not force the wait state.

    Since we now are using a single data per cpu we must synchronize the
    callers to prevent a second caller modifying the data before the
    first called IPI functions complete. To do so, I added a flag to
    the call_single_data called CSD_FLAG_LOCK. When the single CPU is
    called (which can be called when a many call fails an alloc), we
    set the LOCK bit on this per cpu data. When the caller finishes
    it clears the LOCK bit.

    The caller must wait till the LOCK bit is cleared before setting
    it. When it is cleared, there is no IPI function using it.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Acked-by: Jens Axboe
    Acked-by: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

30 Jan, 2009

4 commits

  • root_count was being incremented in cgroup_get_sb() after all error
    checking was complete, but decremented in cgroup_kill_sb(), which can be
    called on a superblock that we gave up on due to an error. This patch
    changes cgroup_kill_sb() to only decrement root_count if the root was
    previously linked into the list of roots.

    Signed-off-by: Paul Menage
    Tested-by: Serge Hallyn
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • css_tryget() and cgroup_clear_css_refs() contain polling loops; these
    loops should have cpu_relax calls in them to reduce cross-cache traffic.

    Signed-off-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • I fixed a bug in cgroup_clone() in Linus' tree in commit 7b574b7
    ("cgroups: fix a race between cgroup_clone and umount") without noticing
    there was a cleanup patch in -mm tree that should be rebased (now commit
    104cbd5, "cgroups: use task_lock() for access tsk->cgroups safe in
    cgroup_clone()"), thus resulted in lock inconsistency.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Now, cgrp->sibling is handled under hierarchy mutex.
    error route should do so, too.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Acked-by Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 Jan, 2009

2 commits


27 Jan, 2009

8 commits

  • * 'hibern_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
    SATA PIIX: Blacklist system that spins off disks during ACPI power off
    SATA Sil: Blacklist system that spins off disks during ACPI power off
    SATA AHCI: Blacklist system that spins off disks during ACPI power off
    SATA: Blacklisting of systems that spin off disks during ACPI power off
    DMI: Introduce dmi_first_match to make the interface more flexible
    Hibernation: Introduce system_entering_hibernation

    Linus Torvalds
     
  • Introduce boolean function system_entering_hibernation() returning
    'true' during the last phase of hibernation, in which devices are
    being put into low power states and the sleep state (for example,
    ACPI S4) is finally entered.

    Some device drivers need such a function to check if the system is
    in the final phase of hibernation. In particular, some SATA drivers
    are going to use it for blacklisting systems in which the disks
    should not be spun down during the last phase of hibernation (the
    BIOS will do that anyway).

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Jeff Garzik

    Rafael J. Wysocki
     
  • With print-fatal-signals=1 on a kernel with CONFIG_PREEMPT=y, sending an
    unexpected signal to a process causes a BUG: using smp_processor_id() in
    preemptible code.

    get_signal_to_deliver() releases the siglock before calling
    print_fatal_signal(), which calls show_regs(), which calls
    smp_processor_id(), which is not supposed to be called from a
    preemptible thread.

    Make sure show_regs() runs with preemption disabled.

    Signed-off-by: Ed Swierk
    Signed-off-by: Ingo Molnar

    Ed Swierk
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fixes:
    kbuild: fix kbuild.txt typos
    kbuild: print usage with no arguments in scripts/config
    Revert "kbuild: strip generated symbols from *.ko"

    Linus Torvalds
     
  • * 'sh/for-2.6.29' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (22 commits)
    dma-coherent: Restore dma_alloc_from_coherent() large alloc fall back policy.
    dma-coherent: per-device coherent area is in pages, not bytes.
    sh: fix unaligned and nonexistent address handling
    nommu: Stub in vm_map_ram()/vm_unmap_ram()/vm_unmap_aliases().
    sh: fix sh-sci / early printk build on sh7723
    sh: export the sh7343 JPU to user space
    sh: update defconfigs.
    serial: sh-sci: Fix up SH7720/SH7721 SCI build.
    sh: Kill off obsolete busses from arch/sh/Kconfig.
    sh: sh7785lcr/highlander/hp6xx need linux/irq.h.
    sh: Migo-R MMC support using spi_gpio and mmc_spi.
    sh: ap325rxa MMC support using spi_gpio and mmc_spi
    sh: mach-x3proto: needs linux/irq.h.
    sh: Drop the BKL from sys_execve() on SH-5.
    sh: convert rsk7203 to use smsc911x.
    sh: convert magicpanelr2 platform to use smsc911x.
    sh: convert ap325rxa platform to use smsc911x.
    sh: mach-migor: Add tw9910 support.
    sh: mach-migor: Delete soc_camera_platform setup.
    sh: mach-migor: Add ov772x support.
    ...

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    debugobjects: add and use INIT_WORK_ON_STACK
    rcu: remove duplicate CONFIG_RCU_CPU_STALL_DETECTOR
    relay: fix lock imbalance in relay_late_setup_files
    oprofile: fix uninitialized use of struct op_entry
    rcu: move Kconfig menu
    softlock: fix false panic which can occur if softlockup_thresh is reduced
    rcu: add __cpuinit to rcu_init_percpu_data()

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    hrtimers: fix inconsistent lock state on resume in hres_timers_resume
    time-sched.c: tick_nohz_update_jiffies should be static
    locking, hpet: annotate false positive warning
    kernel/fork.c: unused variable 'ret'
    itimers: remove the per-cpu-ish-ness

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
    xen: unitialised return value in xenbus_write_transaction
    x86: fix section mismatch warning
    x86: unmask CPUID levels on Intel CPUs, fix
    x86: work around PAGE_KERNEL_WC not getting WC in iomap_atomic_prot_pfn.
    x86: use standard PIT frequency
    xen: handle highmem pages correctly when shrinking a domain
    x86, mm: fix pte_free()
    xen: actually release memory when shrinking domain
    x86: unmask CPUID levels on Intel CPUs
    x86: add MSR_IA32_MISC_ENABLE bits to <asm/msr-index.h>
    x86: fix PTE corruption issue while mapping RAM using /dev/mem
    x86: mtrr fix debug boot parameter
    x86: fix page attribute corruption with cpa()
    Revert "x86: signal: change type of paramter for sys_rt_sigreturn()"
    x86: use early clobbers in usercopy*.c
    x86: remove kernel_physical_mapping_init() from init section
    fix: crash: IP: __bitmap_intersects+0x48/0x73
    cpufreq: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
    work_on_cpu: Use our own workqueue.
    work_on_cpu: don't try to get_online_cpus() in work_on_cpu.
    ...

    Linus Torvalds
     

22 Jan, 2009

2 commits


21 Jan, 2009

7 commits

  • Impact: trace max latencies on start of latency tracing

    This patch sets the max latency to zero whenever one of the
    irq variant tracers or the wakeup tracer is set to current tracer.

    Most developers expect to see output when starting up a latency
    tracer. But since the max_latency is already set to max, and
    it takes a latency greater than max_latency to be recorded, there
    is no trace. This is not the expected behavior and has even confused
    myself.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • Impact: limit ftrace dump output

    Currently ftrace_dump only calls ftrace_kill that is a fast way
    to prevent the function tracer functions from being called (just sets
    a flag and clears the function to call, nothing else). It is better
    to also turn off any recording to the ring buffers as well.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • Impact: fix to print out ftrace_dump when expected

    I was debugging a hard race condition to only find out that
    after I hit the race, my log level was not at level to show
    KERN_INFO. The time it took to trigger the race was wasted because
    I did not capture the trace.

    Since ftrace_dump is only called from kernel oops (and only when
    it is set in the kernel command line to do so), or when a
    developer adds it to their own local tree, the log level of
    the print should be at KERN_EMERG to make sure the print appears.

    ftrace_dump is not called by a normal user setup, and will not
    add extra unwanted print out to the console. There is no reason
    it should be at KERN_INFO.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • Impact: reset struct buffer_page.write when interrupt storm

    if struct buffer_page.write is not reset, any succedent committing
    will corrupted ring_buffer:

    static inline void
    rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
    {
    ......
    cpu_buffer->commit_page->commit =
    cpu_buffer->commit_page->write;
    ......
    }

    when "if (RB_WARN_ON(cpu_buffer, next_page == reader_page))", ring_buffer
    is disabled, but some reserved buffers may haven't been committed.
    we need reset struct buffer_page.write.

    when "if (unlikely(next_page == cpu_buffer->commit_page))", ring_buffer
    is still available, we should not corrupt it.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Lai Jiangshan
     
  • Impact: fix a crash while kernel image restore

    When the function graph tracer is running and while suspend to disk, some racy
    and dangerous things happen against this tracer.

    The current task will save its registers including the stack pointer which
    contains the return address hooked by the tracer. But the current task will
    continue to enter other functions after that to save the memory, and then
    it will store other return addresses, and finally loose the old depth which
    matches the return address saved in the old stack (during the registers saving).

    So on image restore, the code will return to wrong addresses.
    And there are other things: on restore, the task will have it's "current"
    pointer overwritten during registers restoring....switching from one task to
    another... That would be insane to try to trace function graphs at these
    stages.

    This patch makes the function graph tracer listening on power events, making
    it's tracing disabled for the current task (the one that performs the
    hibernation work) while suspend/resume to disk, making the tracing safe
    during hibernation.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • When doing large allocations (larger than the per-device coherent area)
    the generic memory allocators are silently fallen back on regardless of
    consideration for the per-device constraints.

    In the DMA_MEMORY_EXCLUSIVE case falling back on generic memory is not
    an option, as it tends not to be addressable by the DMA hardware in
    question. This issue showed up with the 8139too breakage on the
    Dreamcast, where non-addressable buffers were silently allocated due to
    the size mismatch calculation -- while it should have simply errored out
    upon being unable to satisfy the allocation with the given device
    constraints.

    This restores fall back behaviour to what it was before the oversized
    request change caused multiple regressions.

    Signed-off-by: Paul Mundt

    Paul Mundt
     
  • Commit 58c6d3dfe436eb8cfb451981d8fdc9044eaf42da ("dma-coherent: catch
    oversized requests to dma_alloc_from_coherent()") attempted to add a
    sanity check to bail out on allocations larger than the coherent area.

    Unfortunately when this was implemented, the fact the coherent area
    is tracked in pages rather than bytes was overlooked, which subsequently
    broke every single dma_alloc_from_coherent() user, forcing the allocation
    silently through generic memory instead.

    Signed-off-by: Adrian McMenamin
    Signed-off-by: Paul Mundt

    Adrian McMenamin