03 Apr, 2010

3 commits

  • This patch just states the fact the cpusets/cpuhotplug interaction is
    broken and removes the deadlockable code which only pretends to work.

    - cpuset_lock() doesn't really work. It is needed for
    cpuset_cpus_allowed_locked() but we can't take this lock in
    try_to_wake_up()->select_fallback_rq() path.

    - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
    callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
    stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
    cpuset_lock() and hangs forever because CPU is already dead and thus
    T can't be scheduled.

    - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
    which is not irq-safe, but try_to_wake_up() can be called from irq.

    Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
    we currently do without CONFIG_CPUSETS.

    Also, with or without this patch, with or without CONFIG_CPUSETS, the
    callers of select_fallback_rq() can race with each other or with
    set_cpus_allowed() pathes.

    The subsequent patches try to to fix these problems.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • This is left over from commit 7c9414385e ("sched: Remove USER_SCHED"")

    Signed-off-by: Li Zefan
    Acked-by: Dhaval Giani
    Signed-off-by: Peter Zijlstra
    Cc: David Howells
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • Merge reason: update to latest upstream

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

30 Mar, 2010

6 commits


27 Mar, 2010

6 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
    x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1
    x86/PCI: for host bridge address space collisions, show conflicting resource
    frv/PCI: remove redundant warnings
    x86/PCI: remove redundant warnings
    PCI: don't say we claimed a resource if we failed
    PCI quirk: Disable MSI on VIA K8T890 systems
    PCI quirk: RS780/RS880: work around missing MSI initialization
    PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present
    PCI: complain about devices that seem to be broken
    PCI: print resources consistently with %pR
    PCI: make disabled window printk style match the enabled ones
    PCI: break out primary/secondary/subordinate for readability
    PCI: for address space collisions, show conflicting resource
    resources: add interfaces that return conflict information
    PCI: cleanup error return for pcix get and set mmrbc functions
    PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions
    PCI: kill off pci_register_set_vga_state() symbol export.
    PCI: fix return value from pcix_get_max_mmrbc()

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    time: Fix accumulation bug triggered by long delay.
    posix-cpu-timers: Reset expire cache when no timer is running
    timer stats: Fix del_timer_sync() and try_to_del_timer_sync()
    clockevents: Sanitize min_delta_ns adjustment and prevent overflows

    Linus Torvalds
     
  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align

    Linus Torvalds
     
  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Use proper type in sched_getaffinity()
    kernel/sched.c: Suppress unused var warning
    sched: sched_getaffinity(): Allow less than NR_CPUS length

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq: Move two IRQ functions from .init.text to .text
    genirq: Protect access to irq_desc->action in can_request_irq()
    genirq: Prevent oneshot irq thread race

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86: Remove excessive early_res debug output
    softlockup: Stop spurious softlockup messages due to overflow
    rcu: Fix local_irq_disable() CONFIG_PROVE_RCU=y false positives
    rcu: Fix tracepoints & lockdep false positive
    rcu: Make rcu_read_lock_bh_held() allow for disabled BH

    Linus Torvalds
     

25 Mar, 2010

3 commits


24 Mar, 2010

3 commits


23 Mar, 2010

1 commit

  • The logarithmic accumulation done in the timekeeping has some overflow
    protection that limits the max shift value. That means it will take
    more then shift loops to accumulate all of the cycles. This causes
    the shift decrement to underflow, which causes the loop to never exit.

    The simplest fix would be simply to do a:
    if (shift)
    shift--;

    However that is not optimal, as we know the cycle offset is larger
    then the interval << shift, the above would make shift drop to zero,
    then we would be spinning for quite awhile accumulating at interval
    chunks at a time.

    Instead, this patch only decreases shift if the offset is smaller
    then cycle_interval << shift. This makes sure we accumulate using
    the largest chunks possible without overflowing tick_length, and limits
    the number of iterations through the loop.

    This issue was found and reported by Sonic Zhang, who also tested the fix.
    Many thanks your explanation and testing!

    Reported-by: Sonic Zhang
    Signed-off-by: John Stultz
    Tested-by: Sonic Zhang
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    John Stultz
     

22 Mar, 2010

1 commit


19 Mar, 2010

2 commits

  • The ring buffer uses 4 byte alignment while recording events into the
    buffer, even on 64bit machines. This saves space when there are lots
    of events being recorded at 4 byte boundaries.

    The ring buffer has a zero copy method to write into the buffer, with
    the reserving of space and then committing it. This may cause problems
    when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and
    PPC this is not an issue, but on some architectures this would cause an
    out-of-alignment exception.

    This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
    if it is OK to use 4 byte alignments on 64 bit machines. If it is not,
    it forces the ring buffer event header to be 8 bytes and not 4,
    and will align the length of the data to be 8 byte aligned.
    This keeps the data payload at 8 byte alignments and will allow these
    machines to run without issue.

    The trick to this is that the header can be either 4 bytes or 8 bytes
    depending on the length of the data payload. The 4 byte header
    has a length field that supports up to 112 bytes. If the length of
    the data is more than 112, the length field is set to zero, and the actual
    length is stored in the next 4 bytes after the header.

    When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces
    zero in the 4 byte header forcing the length to be stored in the 4 byte
    array, even with a small data load. It also forces the length of the
    data load to be 8 byte aligned. The combination of these two guarantee
    that the data is always at 8 byte alignment.

    Tested-by: Frederic Weisbecker
    (on sparc64)
    Reported-by: Frederic Weisbecker
    Acked-by: David S. Miller
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
    perf: Fix unexported generic perf_arch_fetch_caller_regs
    perf record: Don't try to find buildids in a zero sized file
    perf: export perf_trace_regs and perf_arch_fetch_caller_regs
    perf, x86: Fix hw_perf_enable() event assignment
    perf, ppc: Fix compile error due to new cpu notifiers
    perf: Make the install relative to DESTDIR if specified
    kprobes: Calculate the index correctly when freeing the out-of-line execution slot
    perf tools: Fix sparse CPU numbering related bugs
    perf_event: Fix oops triggered by cpu offline/online
    perf: Drop the obsolete profile naming for trace events
    perf: Take a hot regs snapshot for trace events
    perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
    perf/x86-64: Use frame pointer to walk on irq and process stacks
    lockdep: Move lock events under lockdep recursion protection
    perf report: Print the map table just after samples for which no map was found
    perf report: Add multiple event support
    perf session: Change perf_session post processing functions to take histogram tree
    perf session: Add storage for seperating event types in report
    perf session: Change add_hist_entry to take the tree root instead of session
    perf record: Add ID and to recorded event data when recording multiple events
    ...

    Linus Torvalds
     

17 Mar, 2010

2 commits

  • perf_arch_fetch_caller_regs() is exported for the overriden x86
    version, but not for the generic weak version.

    As a general rule, weak functions should not have their symbol
    exported in the same file they are defined.

    So let's export it on trace_event_perf.c as it is used by trace
    events only.

    This fixes:

    ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined!
    ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined!

    -v2: And also only build it if trace events are enabled.
    -v3: Fix changelog mistake

    Reported-by: Stephen Rothwell
    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Paul Mackerras
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Using the proper type fixes the following compiler warning:

    kernel/sched.c:4850: warning: comparison of distinct pointer types lacks a cast

    Signed-off-by: KOSAKI Motohiro
    Cc: torvalds@linux-foundation.org
    Cc: travis@sgi.com
    Cc: peterz@infradead.org
    Cc: drepper@redhat.com
    Cc: rja@sgi.com
    Cc: sharyath@in.ibm.com
    Cc: steiner@sgi.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    KOSAKI Motohiro
     

16 Mar, 2010

3 commits

  • On UP:

    kernel/sched.c: In function 'wake_up_new_task':
    kernel/sched.c:2631: warning: unused variable 'cpu'

    Signed-off-by: Andrew Morton
    Cc: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Andrew Morton
     
  • This was left over from "7c9414385e sched: Remove USER_SCHED"

    Signed-off-by: Dan Carpenter
    Acked-by: Dhaval Giani
    Cc: Kay Sievers
    Cc: Greg Kroah-Hartman
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dan Carpenter
     
  • Disabling BH can stand in for rcu_read_lock_bh(), and this patch
    updates rcu_read_lock_bh_held() to allow for this. In order to
    avoid include-file hell, this function is moved out of line to
    kernel/rcupdate.c.

    This fixes a false positive RCU warning.

    Reported-by: Arnd Bergmann
    Reported-by: Eric Dumazet
    Signed-off-by: Paul E. McKenney
    Acked-by: Lai Jiangshan
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

15 Mar, 2010

1 commit

  • [ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]

    Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
    Unfortunately, glibc sched interface has the following definition:

    # define __CPU_SETSIZE 1024
    # define __NCPUBITS (8 * sizeof (__cpu_mask))
    typedef unsigned long int __cpu_mask;
    typedef struct
    {
    __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
    } cpu_set_t;

    It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
    ABI issue ...

    More recently, Sharyathi Nagesh reported following test program makes
    misterious syscall failure:

    -----------------------------------------------------------------------
    #define _GNU_SOURCE
    #include
    #include
    #include

    int main()
    {
    cpu_set_t set;
    if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
    printf("\n Call is failing with:%d", errno);
    }
    -----------------------------------------------------------------------

    Because the kernel assumes len argument of sched_getaffinity() is bigger
    than NR_CPUS. But now it is not correct.

    Now we are faced with the following annoying dilemma, due to
    the limitations of the glibc interface built in years ago:

    (1) if we change glibc's __CPU_SETSIZE definition, we lost
    binary compatibility of _all_ application.

    (2) if we don't change it, we also lost binary compatibility of
    Sharyathi's use case.

    Then, I would propse to change the rule of the len argument of
    sched_getaffinity().

    Old:
    len should be bigger than NR_CPUS
    New:
    len should be bigger than maximum possible cpu id

    This creates the following behavior:

    (A) In the real 4096 cpus machine, the above test program still
    return -EINVAL.

    (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
    all machines in the world), the above can run successfully.

    Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
    they can rebuild their programs.

    IOW we hope they are not annoyed by this issue ...

    Reported-by: Sharyathi Nagesh
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Ulrich Drepper
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Jack Steiner
    Cc: Russ Anderson
    Cc: Mike Travis
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    KOSAKI Motohiro
     

14 Mar, 2010

4 commits

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix pick_next_highest_task_rt() for cgroups
    sched: Cleanup: remove unused variable in try_to_wake_up()
    x86: Fix sched_clock_cpu for systems with unsynchronized TSC

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     
  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    tracing: Do not record user stack trace from NMI context
    tracing: Disable buffer switching when starting or stopping trace
    tracing: Use same local variable when resetting the ring buffer
    function-graph: Init curr_ret_stack with ret_stack
    ring-buffer: Move disabled check into preempt disable section
    function-graph: Add tracing_thresh support to function_graph tracer
    tracing: Update the comm field in the right variable in update_max_tr
    function-graph: Use comment notation for func names of dangling '}'
    function-graph: Fix unused reference to ftrace_set_func()
    tracing: Fix warning in s_next of trace file ops
    tracing: Include irqflags headers from trace clock

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf: Provide generic perf_sample_data initialization
    MAINTAINERS: Add Arnaldo as tools/perf/ co-maintainer
    perf trace: Don't use pager if scripting
    perf trace/scripting: Remove extraneous header read
    perf, ARM: Modify kuser rmb() call to compile for Thumb-2
    x86/stacktrace: Don't dereference bad frame pointers
    perf archive: Don't try to collect files without a build-id
    perf_events, x86: Fixup fixed counter constraints
    perf, x86: Restrict the ANY flag
    perf, x86: rename macro in ARCH_PERFMON_EVENTSEL_ENABLE
    perf, x86: add some IBS macros to perf_event.h
    perf, x86: make IBS macros available in perf_event.h
    hw-breakpoints: Remove stub unthrottle callback
    x86/hw-breakpoints: Remove the name field
    perf: Remove pointless breakpoint union
    perf lock: Drop the buffers multiplexing dependency
    perf lock: Fix and add misc documentally things
    percpu: Add __percpu sparse annotations to hw_breakpoint

    Linus Torvalds
     

13 Mar, 2010

5 commits

  • A bug was found with Li Zefan's ftrace_stress_test that caused applications
    to segfault during the test.

    Placing a tracing_off() in the segfault code, and examining several
    traces, I found that the following was always the case. The lock tracer
    was enabled (lockdep being required) and userstack was enabled. Testing
    this out, I just enabled the two, but that was not good enough. I needed
    to run something else that could trigger it. Running a load like hackbench
    did not work, but executing a new program would. The following would
    trigger the segfault within seconds:

    # echo 1 > /debug/tracing/options/userstacktrace
    # echo 1 > /debug/tracing/events/lock/enable
    # while :; do ls > /dev/null ; done

    Enabling the function graph tracer and looking at what was happening
    I finally noticed that all cashes happened just after an NMI.

    1) | copy_user_handle_tail() {
    1) | bad_area_nosemaphore() {
    1) | __bad_area_nosemaphore() {
    1) | no_context() {
    1) | fixup_exception() {
    1) 0.319 us | search_exception_tables();
    1) 0.873 us | }
    [...]
    1) 0.314 us | __rcu_read_unlock();
    1) 0.325 us | native_apic_mem_write();
    1) 0.943 us | }
    1) 0.304 us | rcu_nmi_exit();
    [...]
    1) 0.479 us | find_vma();
    1) | bad_area() {
    1) | __bad_area() {

    After capturing several traces of failures, all of them happened
    after an NMI. Curious about this, I added a trace_printk() to the NMI
    handler to read the regs->ip to see where the NMI happened. In which I
    found out it was here:

    ffffffff8135b660 :
    ffffffff8135b660: 48 83 ec 78 sub $0x78,%rsp
    ffffffff8135b664: e8 97 01 00 00 callq ffffffff8135b800

    What was happening is that the NMI would happen at the place that a page
    fault occurred. It would call rcu_read_lock() which was traced by
    the lock events, and the user_stack_trace would run. This would trigger
    a page fault inside the NMI. I do not see where the CR2 register is
    saved or restored in NMI handling. This means that it would corrupt
    the page fault handling that the NMI interrupted.

    The reason the while loop of ls helped trigger the bug, was that
    each execution of ls would cause lots of pages to be faulted in, and
    increase the chances of the race happening.

    The simple solution is to not allow user stack traces in NMI context.
    After this patch, I ran the above "ls" test for a couple of hours
    without any issues. Without this patch, the bug would trigger in less
    than a minute.

    Cc: stable@kernel.org
    Reported-by: Li Zefan
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • When the trace iterator is read, tracing_start() and tracing_stop()
    is called to stop tracing while the iterator is processing the trace
    output.

    These functions disable both the standard buffer and the max latency
    buffer. But if the wakeup tracer is running, it can switch these
    buffers between the two disables:

    buffer = global_trace.buffer;
    if (buffer)
    ring_buffer_record_disable(buffer);

    <<
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • In the ftrace code that resets the ring buffer it references the
    buffer with a local variable, but then uses the tr->buffer as the
    parameter to reset. If the wakeup tracer is running, which can
    switch the tr->buffer with the max saved buffer, this can break
    the requirement of disabling the buffer before the reset.

    buffer = tr->buffer;
    ring_buffer_record_disable(buffer);
    synchronize_sched();
    __tracing_reset(tr->buffer, cpu);

    If the tr->buffer is swapped, then the reset is not happening to the
    buffer that was disabled. This will cause the ring buffer to fail.

    Found with Li Zefan's ftrace_stress_test.

    Cc: stable@kernel.org
    Reported-by: Lai Jiangshan
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • If the graph tracer is active, and a task is forked but the allocating of
    the processes graph stack fails, it can cause crash later on.

    This is due to the temporary stack being NULL, but the curr_ret_stack
    variable is copied from the parent. If it is not -1, then in
    ftrace_graph_probe_sched_switch() the following:

    for (index = next->curr_ret_stack; index >= 0; index--)
    next->ret_stack[index].calltime += timestamp;

    Will cause a kernel OOPS.

    Found with Li Zefan's ftrace_stress_test.

    Cc: stable@kernel.org
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The ring buffer resizing and resetting relies on a schedule RCU
    action. The buffers are disabled, a synchronize_sched() is called
    and then the resize or reset takes place.

    But this only works if the disabling of the buffers are within the
    preempt disabled section, otherwise a window exists that the buffers
    can be written to while a reset or resize takes place.

    Cc: stable@kernel.org
    Reported-by: Li Zefan
    Signed-off-by: Lai Jiangshan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Lai Jiangshan