15 Apr, 2010

1 commit


11 Apr, 2010

1 commit

  • When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device
    improperly by old_decode_dev and it results in an error while
    hibernating with s2disk.

    All users already pass the new device number, so switch to
    new_decode_dev().

    Signed-off-by: Jiri Slaby
    Reported-and-tested-by: Jiri Kosina
    Signed-off-by: "Rafael J. Wysocki"

    Jiri Slaby
     

08 Apr, 2010

1 commit


07 Apr, 2010

2 commits


06 Apr, 2010

5 commits

  • taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with
    the following error:

    sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument)

    This box has 128 threads and 16 bytes is enough to cover it.

    Commit cd3d8031eb4311e516329aee03c79a08333141f1 (sched:
    sched_getaffinity(): Allow less than NR_CPUS length) is
    comparing this 16 bytes agains nr_cpu_ids.

    Fix it by comparing nr_cpu_ids to the number of bits in the
    cpumask we pass in.

    Signed-off-by: Anton Blanchard
    Reviewed-by: KOSAKI Motohiro
    Cc: Sharyathi Nagesh
    Cc: Ulrich Drepper
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jack Steiner
    Cc: Russ Anderson
    Cc: Mike Travis
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Anton Blanchard
     
  • Module refcounting is implemented with a per-cpu counter for speed.
    However there is a race when tallying the counter where a reference may
    be taken by one CPU and released by another. Reference count summation
    may then see the decrement without having seen the previous increment,
    leading to lower than expected count. A module which never has its
    actual reference drop below 1 may return a reference count of 0 due to
    this race.

    Module removal generally runs under stop_machine, which prevents this
    race causing bugs due to removal of in-use modules. However there are
    other real bugs in module.c code and driver code (module_refcount is
    exported) where the callers do not run under stop_machine.

    Fix this by maintaining running per-cpu counters for the number of
    module refcount increments and the number of refcount decrements. The
    increments are tallied after the decrements, so any decrement seen will
    always have its corresponding increment counted. The final refcount is
    the difference of the total increments and decrements, preventing a
    low-refcount from being returned.

    Signed-off-by: Nick Piggin
    Acked-by: Rusty Russell
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • There have been a number of reports of people seeing the message:
    "name_count maxed, losing inode data: dev=00:05, inode=3185"
    in dmesg. These usually lead to people reporting problems to the filesystem
    group who are in turn clueless what they mean.

    Eventually someone finds me and I explain what is going on and that
    these come from the audit system. The basics of the problem is that the
    audit subsystem never expects a single syscall to 'interact' (for some
    wish washy meaning of interact) with more than 20 inodes. But in fact
    some operations like loading kernel modules can cause changes to lots of
    inodes in debugfs.

    There are a couple real fixes being bandied about including removing the
    fixed compile time limit of 20 or not auditing changes in debugfs (or
    both) but neither are small and obvious so I am not sending them for
    immediate inclusion (I hope Al forwards a real solution next devel
    window).

    In the meantime this patch simply adds 'audit' to the beginning of the
    crap message so if a user sees it, they come blame me first and we can
    talk about what it means and make sure we understand all of the reasons
    it can happen and make sure this gets solved correctly in the long run.

    Signed-off-by: Eric Paris
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • * 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc:
    eeepc-wmi: include slab.h
    staging/otus: include slab.h from usbdrv.h
    percpu: don't implicitly include slab.h from percpu.h
    kmemcheck: Fix build errors due to missing slab.h
    include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
    iwlwifi: don't include iwl-dev.h from iwl-devtrace.h
    x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h

    Fix up trivial conflicts in include/linux/percpu.h due to
    is_kernel_percpu_address() having been introduced since the slab.h
    cleanup with the percpu_up.c splitup.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    module: add stub for is_module_percpu_address
    percpu, module: implement and use is_kernel/module_percpu_address()
    module: encapsulate percpu handling better and record percpu_size

    Linus Torvalds
     

05 Apr, 2010

4 commits


03 Apr, 2010

23 commits

  • Now that software events use perf_arch_fetch_caller_regs() too, we
    need the stub version to be always built in for archs that don't
    implement it.

    Fixes the following build error in PARISC:

    kernel/built-in.o: In function `perf_event_task_sched_out':
    (.text.perf_event_task_sched_out+0x54): undefined reference to `perf_arch_fetch_caller_regs'

    Reported-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras

    Frederic Weisbecker
     
  • * 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
    kgdb: Turn off tracing while in the debugger
    kgdb: use atomic_inc and atomic_dec instead of atomic_set
    kgdb: eliminate kgdb_wait(), all cpus enter the same way
    kgdbts,sh: Add in breakpoint pc offset for superh
    kgdb: have ebin2mem call probe_kernel_write once

    Linus Torvalds
     
  • * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
    Freezer: Fix buggy resume test for tasks frozen with cgroup freezer
    Freezer: Only show the state of tasks refusing to freeze

    Linus Torvalds
     
  • The kernel debugger should turn off kernel tracing any time the
    debugger is active and restore it on resume.

    Signed-off-by: Jason Wessel
    Reviewed-by: Steven Rostedt

    Jason Wessel
     
  • Memory barriers should be used for the kgdb cpu synchronization. The
    atomic_set() does not imply a memory barrier.

    Reported-by: Will Deacon
    Signed-off-by: Jason Wessel

    Jason Wessel
     
  • This is a kgdb architectural change to have all the cpus (master or
    slave) enter the same function.

    A cpu that hits an exception (wants to be the master cpu) will call
    kgdb_handle_exception() from the trap handler and then invoke a
    kgdb_roundup_cpu() to synchronize the other cpus and bring them into
    the kgdb_handle_exception() as well.

    A slave cpu will enter kgdb_handle_exception() from the
    kgdb_nmicallback() and set the exception state to note that the
    processor is a slave.

    Previously the salve cpu would have called kgdb_wait(). This change
    allows the debug core to change cpus without resuming the system in
    order to inspect arch specific cpu information.

    Signed-off-by: Jason Wessel

    Jason Wessel
     
  • Rather than call probe_kernel_write() one byte at a time, process the
    whole buffer locally and pass the entire result in one go. This way,
    architectures that need to do special handling based on the length can
    do so, or we only end up calling memcpy() once.

    [sonic.zhang@analog.com: Reported original problem and preliminary patch]

    Signed-off-by: Jason Wessel
    Signed-off-by: Sonic Zhang
    Signed-off-by: Mike Frysinger

    Jason Wessel
     
  • In order to reduce the dependency on TASK_WAKING rework the enqueue
    interface to support a proper flags field.

    Replace the int wakeup, bool head arguments with an int flags argument
    and create the following flags:

    ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
    ENQUEUE_WAKING - the enqueue has relative vruntime due to
    having sched_class::task_waking() called,
    ENQUEUE_HEAD - the waking task should be places on the head
    of the priority queue (where appropriate).

    For symmetry also convert sched_class::dequeue() to a flags scheme.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The cpuload calculation in calc_load_account_active() assumes
    rq->nr_uninterruptible will not change on an offline cpu after
    migrate_nr_uninterruptible(). However the recent migrate on wakeup
    changes broke that and would result in decrementing the offline cpu's
    rq->nr_uninterruptible.

    Fix this by accounting the nr_uninterruptible on the waking cpu.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Now that we hold the rq->lock over set_task_cpu() again, we can do
    away with most of the TASK_WAKING checks and reduce them again to
    set_cpus_allowed_ptr().

    Removes some conditionals from scheduling hot-paths.

    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Oleg noticed a few races with the TASK_WAKING usage on fork.

    - since TASK_WAKING is basically a spinlock, it should be IRQ safe
    - since we set TASK_WAKING (*) without holding rq->lock it could
    be there still is a rq->lock holder, thereby not actually
    providing full serialization.

    (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.

    Cure the second issue by not setting TASK_WAKING in sched_fork(), but
    only temporarily in wake_up_new_task() while calling select_task_rq().

    Cure the first by holding rq->lock around the select_task_rq() call,
    this will disable IRQs, this however requires that we push down the
    rq->lock release into select_task_rq_fair()'s cgroup stuff.

    Because select_task_rq_fair() still needs to drop the rq->lock we
    cannot fully get rid of TASK_WAKING.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
    with select_fallback_rq(). It can be called from any context and can't use
    any cpuset locks including task_lock(). It is called when the task doesn't
    have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
    suitable cpu.

    I am not proud of this patch. Everything which needs such a fat comment
    can't be good even if correct. But I'd prefer to not change the locking
    rules in the code I hardly understand, and in any case I believe this
    simple change make the code much more correct compared to deadlocks we
    currently have.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • _cpu_down() changes the current task's affinity and then recovers it at
    the end. The problems are well known: we can't restore old_allowed if it
    was bound to the now-dead-cpu, and we can race with the userspace which
    can change cpu-affinity during unplug.

    _cpu_down() should not play with current->cpus_allowed at all. Instead,
    take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
    removes the dying cpu from cpu_online_mask.

    Signed-off-by: Oleg Nesterov
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless.
    This can race with other CPUs updating our ->cpus_allowed, and this
    looks meaningless to me.

    The task is current and running, it must have online cpus in ->cpus_allowed,
    the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu,
    this likely means we raced with set_cpus_allowed() which was called
    for reason, why should sched_exec() retry and call ->select_task_rq()
    again?

    Change the code to call sched_class->select_task_rq() directly and do
    nothing if the returned cpu is wrong after re-checking under rq->lock.

    From now task_struct->cpus_allowed is always stable under TASK_WAKING,
    select_fallback_rq() is always called under rq-lock or the caller or
    the caller owns TASK_WAKING (select_task_rq).

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The previous patch preserved the retry logic, but it looks unneeded.

    __migrate_task() can only fail if we raced with migration after we dropped
    the lock, but in this case the caller of set_cpus_allowed/etc must initiate
    migration itself if ->on_rq == T.

    We already fixed p->cpus_allowed, the changes in active/online masks must
    be visible to racer, it should migrate the task to online cpu correctly.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
    lockless. We can race with set_cpus_allowed() running in parallel.

    Change it to take rq->lock around select_fallback_rq(). Note that it is not
    trivial to move this spin_lock() into select_fallback_rq(), we must recheck
    the task was not migrated after we take the lock and other callers do not
    need this lock.

    To avoid the races with other callers of select_fallback_rq() which rely on
    TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
    The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
    CPU anyway, and the subsequent __migrate_task() is just meaningless because
    p->se.on_rq must be false.

    Alternatively, we could change select_task_rq() to take rq->lock right
    after it calls sched_class->select_task_rq(), but this looks a bit ugly.

    Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • This patch just states the fact the cpusets/cpuhotplug interaction is
    broken and removes the deadlockable code which only pretends to work.

    - cpuset_lock() doesn't really work. It is needed for
    cpuset_cpus_allowed_locked() but we can't take this lock in
    try_to_wake_up()->select_fallback_rq() path.

    - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
    callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
    stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
    cpuset_lock() and hangs forever because CPU is already dead and thus
    T can't be scheduled.

    - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
    which is not irq-safe, but try_to_wake_up() can be called from irq.

    Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
    we currently do without CONFIG_CPUSETS.

    Also, with or without this patch, with or without CONFIG_CPUSETS, the
    callers of select_fallback_rq() can race with each other or with
    set_cpus_allowed() pathes.

    The subsequent patches try to to fix these problems.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • This is left over from commit 7c9414385e ("sched: Remove USER_SCHED"")

    Signed-off-by: Li Zefan
    Acked-by: Dhaval Giani
    Signed-off-by: Peter Zijlstra
    Cc: David Howells
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • Trivial typo fix. rq->migration_thread can be NULL after
    task_rq_unlock(), this is why we have "mt" which should be
    used instead.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Latencytop clearing sum_exec_runtime via proc_sched_set_task() breaks
    task_times(). Other places in kernel use nvcsw and nivcsw, which are
    being cleared as well, Clear task statistics only.

    Reported-by: Török Edwin
    Signed-off-by: Mike Galbraith
    Cc: Hidetoshi Seto
    Cc: Arjan van de Ven
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Merge reason: update to latest upstream

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • perf sched record can deadlock a box should the holder of
    handle->data->lock take an interrupt, and then attempt to
    acquire an rq lock held by a CPU trying to acquire the
    same lock. Disable interrupts.

    CPU0 CPU1
    sched event with rq->lock held
    grab handle->data->lock
    spin on handle->data->lock
    interrupt
    try to grab rq->lock

    Reported-by: Li Zefan
    Signed-off-by: Mike Galbraith
    Tested-by: Li Zefan
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • …eric/random-tracing into perf/urgent

    Ingo Molnar
     

01 Apr, 2010

2 commits

  • Scheduler's task migration events don't work because they always
    pass NULL regs perf_sw_event(). The event hence gets filtered
    in perf_swevent_add().

    Scheduler's context switches events use task_pt_regs() to get
    the context when the event occured which is a wrong thing to
    do as this won't give us the place in the kernel where we went
    to sleep but the place where we left userspace. The result is
    even more wrong if we switch from a kernel thread.

    Use the hot regs snapshot for both events as they belong to the
    non-interrupt/exception based events family. Unlike page faults
    or so that provide the regs matching the exact origin of the event,
    we need to save the current context.

    This makes the task migration event working and fix the context
    switch callchains and origin ip.

    Example: perf record -a -e cs

    Before:

    10.91% ksoftirqd/0 0 [k] 0000000000000000
    |
    --- (nil)
    perf_callchain
    perf_prepare_sample
    __perf_event_overflow
    perf_swevent_overflow
    perf_swevent_add
    perf_swevent_ctx_event
    do_perf_sw_event
    __perf_sw_event
    perf_event_task_sched_out
    schedule
    run_ksoftirqd
    kthread
    kernel_thread_helper

    After:

    23.77% hald-addon-stor [kernel.kallsyms] [k] schedule
    |
    --- schedule
    |
    |--60.00%-- schedule_timeout
    | wait_for_common
    | wait_for_completion
    | blk_execute_rq
    | scsi_execute
    | scsi_execute_req
    | sr_test_unit_ready
    | |
    | |--66.67%-- sr_media_change
    | | media_changed
    | | cdrom_media_changed
    | | sr_block_media_changed
    | | check_disk_change
    | | cdrom_open

    v2: Always build perf_arch_fetch_caller_regs() now that software
    events need that too. They don't need it from modules, unlike trace
    events, so we keep the EXPORT_SYMBOL in trace_event_perf.c

    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: David Miller

    Frederic Weisbecker
     
  • The trace event buffer used by perf to record raw sample events
    is typed as an array of char and may then not be aligned to 8
    by alloc_percpu().

    But we need it to be aligned to 8 in sparc64 because we cast
    this buffer into a random structure type built by the TRACE_EVENT()
    macro to store the traces. So if a random 64 bits field is accessed
    inside, it may be not under an expected good alignment.

    Use an array of long instead to force the appropriate alignment, and
    perform a compile time check to ensure the size in byte of the buffer
    is a multiple of sizeof(long) so that its actual size doesn't get
    shrinked under us.

    This fixes unaligned accesses reported while using perf lock
    in sparc 64.

    Suggested-by: David Miller
    Suggested-by: Tejun Heo
    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Steven Rostedt

    Frederic Weisbecker
     

31 Mar, 2010

1 commit

  • Network folks reported that directing all MSI-X vectors of their multi
    queue NICs to a single core can cause interrupt stack overflows when
    enough interrupts fire at the same time.

    This is caused by the fact that we run interrupt handlers by default
    with interrupts enabled unless the driver reuqests the interrupt with
    the IRQF_DISABLED set. The NIC handlers do not set this flag, so
    simultaneous interrupts can nest unlimited and cause the stack
    overflow.

    The only safe counter measure is to run the interrupt handlers with
    interrupts disabled. We can't switch to this mode in general right
    now, but it is safe to do so for MSI interrupts.

    Force IRQF_DISABLED for MSI interrupt handlers.

    Signed-off-by: Thomas Gleixner
    Cc: Andi Kleen
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alan Cox
    Cc: David Miller
    Cc: Greg Kroah-Hartman
    Cc: Arnaldo Carvalho de Melo
    Cc: stable@kernel.org

    Thomas Gleixner