15 Apr, 2010
1 commit
-
Merge reason: merge the latest fixes, update to -rc4.
Signed-off-by: Ingo Molnar
11 Apr, 2010
1 commit
-
When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device
improperly by old_decode_dev and it results in an error while
hibernating with s2disk.All users already pass the new device number, so switch to
new_decode_dev().Signed-off-by: Jiri Slaby
Reported-and-tested-by: Jiri Kosina
Signed-off-by: "Rafael J. Wysocki"
08 Apr, 2010
1 commit
-
…l/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix sched_getaffinity()
07 Apr, 2010
2 commits
-
- We weren't zeroing p->rss_stat[] at fork()
- Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
threads and was oopsing.- Make __sync_task_rss_stat() static, too.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648
[akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
Reported-by: Troels Liebe Bentsen
Signed-off-by: KAMEZAWA Hiroyuki
"Michael S. Tsirkin"
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Force MSI irq handlers to run with interrupts disabled
06 Apr, 2010
5 commits
-
taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with
the following error:sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument)
This box has 128 threads and 16 bytes is enough to cover it.
Commit cd3d8031eb4311e516329aee03c79a08333141f1 (sched:
sched_getaffinity(): Allow less than NR_CPUS length) is
comparing this 16 bytes agains nr_cpu_ids.Fix it by comparing nr_cpu_ids to the number of bits in the
cpumask we pass in.Signed-off-by: Anton Blanchard
Reviewed-by: KOSAKI Motohiro
Cc: Sharyathi Nagesh
Cc: Ulrich Drepper
Cc: Peter Zijlstra
Cc: Linus Torvalds
Cc: Jack Steiner
Cc: Russ Anderson
Cc: Mike Travis
LKML-Reference:
Signed-off-by: Ingo Molnar -
Module refcounting is implemented with a per-cpu counter for speed.
However there is a race when tallying the counter where a reference may
be taken by one CPU and released by another. Reference count summation
may then see the decrement without having seen the previous increment,
leading to lower than expected count. A module which never has its
actual reference drop below 1 may return a reference count of 0 due to
this race.Module removal generally runs under stop_machine, which prevents this
race causing bugs due to removal of in-use modules. However there are
other real bugs in module.c code and driver code (module_refcount is
exported) where the callers do not run under stop_machine.Fix this by maintaining running per-cpu counters for the number of
module refcount increments and the number of refcount decrements. The
increments are tallied after the decrements, so any decrement seen will
always have its corresponding increment counted. The final refcount is
the difference of the total increments and decrements, preventing a
low-refcount from being returned.Signed-off-by: Nick Piggin
Acked-by: Rusty Russell
Signed-off-by: Linus Torvalds -
There have been a number of reports of people seeing the message:
"name_count maxed, losing inode data: dev=00:05, inode=3185"
in dmesg. These usually lead to people reporting problems to the filesystem
group who are in turn clueless what they mean.Eventually someone finds me and I explain what is going on and that
these come from the audit system. The basics of the problem is that the
audit subsystem never expects a single syscall to 'interact' (for some
wish washy meaning of interact) with more than 20 inodes. But in fact
some operations like loading kernel modules can cause changes to lots of
inodes in debugfs.There are a couple real fixes being bandied about including removing the
fixed compile time limit of 20 or not auditing changes in debugfs (or
both) but neither are small and obvious so I am not sending them for
immediate inclusion (I hope Al forwards a real solution next devel
window).In the meantime this patch simply adds 'audit' to the beginning of the
crap message so if a user sees it, they come blame me first and we can
talk about what it means and make sure we understand all of the reasons
it can happen and make sure this gets solved correctly in the long run.Signed-off-by: Eric Paris
Signed-off-by: Linus Torvalds -
* 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc:
eeepc-wmi: include slab.h
staging/otus: include slab.h from usbdrv.h
percpu: don't implicitly include slab.h from percpu.h
kmemcheck: Fix build errors due to missing slab.h
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
iwlwifi: don't include iwl-dev.h from iwl-devtrace.h
x86: don't include slab.h from arch/x86/include/asm/pgtable_32.hFix up trivial conflicts in include/linux/percpu.h due to
is_kernel_percpu_address() having been introduced since the slab.h
cleanup with the percpu_up.c splitup. -
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
module: add stub for is_module_percpu_address
percpu, module: implement and use is_kernel/module_percpu_address()
module: encapsulate percpu handling better and record percpu_size
05 Apr, 2010
4 commits
-
…/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Always build the powerpc perf_arch_fetch_caller_regs version
perf: Always build the stub perf_arch_fetch_caller_regs version
perf, probe-finder: Build fix on Debian
perf/scripts: Tuple was set from long in both branches in python_process_event()
perf: Fix 'perf sched record' deadlock
perf, x86: Fix callgraphs of 32-bit processes on 64-bit kernels
perf, x86: Fix AMD hotplug & constraint initialization
x86: Move notify_cpu_starting() callback to a later stage
x86,kgdb: Always initialize the hw breakpoint attribute
perf: Use hot regs with software sched switch/migrate events
perf: Correctly align perf event tracing buffer -
…l/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock
sched: Fix proc_sched_set_task() -
…nel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ring-buffer: Add missing unlock
tracing: Fix lockdep warning in global_clock()
03 Apr, 2010
23 commits
-
Now that software events use perf_arch_fetch_caller_regs() too, we
need the stub version to be always built in for archs that don't
implement it.Fixes the following build error in PARISC:
kernel/built-in.o: In function `perf_event_task_sched_out':
(.text.perf_event_task_sched_out+0x54): undefined reference to `perf_arch_fetch_caller_regs'Reported-by: Ingo Molnar
Signed-off-by: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Paul Mackerras -
* 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdb: Turn off tracing while in the debugger
kgdb: use atomic_inc and atomic_dec instead of atomic_set
kgdb: eliminate kgdb_wait(), all cpus enter the same way
kgdbts,sh: Add in breakpoint pc offset for superh
kgdb: have ebin2mem call probe_kernel_write once -
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
Freezer: Fix buggy resume test for tasks frozen with cgroup freezer
Freezer: Only show the state of tasks refusing to freeze -
The kernel debugger should turn off kernel tracing any time the
debugger is active and restore it on resume.Signed-off-by: Jason Wessel
Reviewed-by: Steven Rostedt -
Memory barriers should be used for the kgdb cpu synchronization. The
atomic_set() does not imply a memory barrier.Reported-by: Will Deacon
Signed-off-by: Jason Wessel -
This is a kgdb architectural change to have all the cpus (master or
slave) enter the same function.A cpu that hits an exception (wants to be the master cpu) will call
kgdb_handle_exception() from the trap handler and then invoke a
kgdb_roundup_cpu() to synchronize the other cpus and bring them into
the kgdb_handle_exception() as well.A slave cpu will enter kgdb_handle_exception() from the
kgdb_nmicallback() and set the exception state to note that the
processor is a slave.Previously the salve cpu would have called kgdb_wait(). This change
allows the debug core to change cpus without resuming the system in
order to inspect arch specific cpu information.Signed-off-by: Jason Wessel
-
Rather than call probe_kernel_write() one byte at a time, process the
whole buffer locally and pass the entire result in one go. This way,
architectures that need to do special handling based on the length can
do so, or we only end up calling memcpy() once.[sonic.zhang@analog.com: Reported original problem and preliminary patch]
Signed-off-by: Jason Wessel
Signed-off-by: Sonic Zhang
Signed-off-by: Mike Frysinger -
In order to reduce the dependency on TASK_WAKING rework the enqueue
interface to support a proper flags field.Replace the int wakeup, bool head arguments with an int flags argument
and create the following flags:ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
ENQUEUE_WAKING - the enqueue has relative vruntime due to
having sched_class::task_waking() called,
ENQUEUE_HEAD - the waking task should be places on the head
of the priority queue (where appropriate).For symmetry also convert sched_class::dequeue() to a flags scheme.
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
The cpuload calculation in calc_load_account_active() assumes
rq->nr_uninterruptible will not change on an offline cpu after
migrate_nr_uninterruptible(). However the recent migrate on wakeup
changes broke that and would result in decrementing the offline cpu's
rq->nr_uninterruptible.Fix this by accounting the nr_uninterruptible on the waking cpu.
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Now that we hold the rq->lock over set_task_cpu() again, we can do
away with most of the TASK_WAKING checks and reduce them again to
set_cpus_allowed_ptr().Removes some conditionals from scheduling hot-paths.
Signed-off-by: Peter Zijlstra
Cc: Oleg Nesterov
LKML-Reference:
Signed-off-by: Ingo Molnar -
Oleg noticed a few races with the TASK_WAKING usage on fork.
- since TASK_WAKING is basically a spinlock, it should be IRQ safe
- since we set TASK_WAKING (*) without holding rq->lock it could
be there still is a rq->lock holder, thereby not actually
providing full serialization.(*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
Cure the second issue by not setting TASK_WAKING in sched_fork(), but
only temporarily in wake_up_new_task() while calling select_task_rq().Cure the first by holding rq->lock around the select_task_rq() call,
this will disable IRQs, this however requires that we push down the
rq->lock release into select_task_rq_fair()'s cgroup stuff.Because select_task_rq_fair() still needs to drop the rq->lock we
cannot fully get rid of TASK_WAKING.Reported-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
with select_fallback_rq(). It can be called from any context and can't use
any cpuset locks including task_lock(). It is called when the task doesn't
have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
suitable cpu.I am not proud of this patch. Everything which needs such a fat comment
can't be good even if correct. But I'd prefer to not change the locking
rules in the code I hardly understand, and in any case I believe this
simple change make the code much more correct compared to deadlocks we
currently have.Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
_cpu_down() changes the current task's affinity and then recovers it at
the end. The problems are well known: we can't restore old_allowed if it
was bound to the now-dead-cpu, and we can race with the userspace which
can change cpu-affinity during unplug._cpu_down() should not play with current->cpus_allowed at all. Instead,
take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
removes the dying cpu from cpu_online_mask.Signed-off-by: Oleg Nesterov
Acked-by: Rafael J. Wysocki
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless.
This can race with other CPUs updating our ->cpus_allowed, and this
looks meaningless to me.The task is current and running, it must have online cpus in ->cpus_allowed,
the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu,
this likely means we raced with set_cpus_allowed() which was called
for reason, why should sched_exec() retry and call ->select_task_rq()
again?Change the code to call sched_class->select_task_rq() directly and do
nothing if the returned cpu is wrong after re-checking under rq->lock.From now task_struct->cpus_allowed is always stable under TASK_WAKING,
select_fallback_rq() is always called under rq-lock or the caller or
the caller owns TASK_WAKING (select_task_rq).Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
The previous patch preserved the retry logic, but it looks unneeded.
__migrate_task() can only fail if we raced with migration after we dropped
the lock, but in this case the caller of set_cpus_allowed/etc must initiate
migration itself if ->on_rq == T.We already fixed p->cpus_allowed, the changes in active/online masks must
be visible to racer, it should migrate the task to online cpu correctly.Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
lockless. We can race with set_cpus_allowed() running in parallel.Change it to take rq->lock around select_fallback_rq(). Note that it is not
trivial to move this spin_lock() into select_fallback_rq(), we must recheck
the task was not migrated after we take the lock and other callers do not
need this lock.To avoid the races with other callers of select_fallback_rq() which rely on
TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
CPU anyway, and the subsequent __migrate_task() is just meaningless because
p->se.on_rq must be false.Alternatively, we could change select_task_rq() to take rq->lock right
after it calls sched_class->select_task_rq(), but this looks a bit ugly.Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().
Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
This patch just states the fact the cpusets/cpuhotplug interaction is
broken and removes the deadlockable code which only pretends to work.- cpuset_lock() doesn't really work. It is needed for
cpuset_cpus_allowed_locked() but we can't take this lock in
try_to_wake_up()->select_fallback_rq() path.- cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
cpuset_lock() and hangs forever because CPU is already dead and thus
T can't be scheduled.- cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
which is not irq-safe, but try_to_wake_up() can be called from irq.Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
we currently do without CONFIG_CPUSETS.Also, with or without this patch, with or without CONFIG_CPUSETS, the
callers of select_fallback_rq() can race with each other or with
set_cpus_allowed() pathes.The subsequent patches try to to fix these problems.
Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
This is left over from commit 7c9414385e ("sched: Remove USER_SCHED"")
Signed-off-by: Li Zefan
Acked-by: Dhaval Giani
Signed-off-by: Peter Zijlstra
Cc: David Howells
LKML-Reference:
Signed-off-by: Ingo Molnar -
Trivial typo fix. rq->migration_thread can be NULL after
task_rq_unlock(), this is why we have "mt" which should be
used instead.Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Latencytop clearing sum_exec_runtime via proc_sched_set_task() breaks
task_times(). Other places in kernel use nvcsw and nivcsw, which are
being cleared as well, Clear task statistics only.Reported-by: Török Edwin
Signed-off-by: Mike Galbraith
Cc: Hidetoshi Seto
Cc: Arjan van de Ven
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Merge reason: update to latest upstream
Signed-off-by: Ingo Molnar
-
perf sched record can deadlock a box should the holder of
handle->data->lock take an interrupt, and then attempt to
acquire an rq lock held by a CPU trying to acquire the
same lock. Disable interrupts.CPU0 CPU1
sched event with rq->lock held
grab handle->data->lock
spin on handle->data->lock
interrupt
try to grab rq->lockReported-by: Li Zefan
Signed-off-by: Mike Galbraith
Tested-by: Li Zefan
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
LKML-Reference:
Signed-off-by: Ingo Molnar -
…eric/random-tracing into perf/urgent
01 Apr, 2010
2 commits
-
Scheduler's task migration events don't work because they always
pass NULL regs perf_sw_event(). The event hence gets filtered
in perf_swevent_add().Scheduler's context switches events use task_pt_regs() to get
the context when the event occured which is a wrong thing to
do as this won't give us the place in the kernel where we went
to sleep but the place where we left userspace. The result is
even more wrong if we switch from a kernel thread.Use the hot regs snapshot for both events as they belong to the
non-interrupt/exception based events family. Unlike page faults
or so that provide the regs matching the exact origin of the event,
we need to save the current context.This makes the task migration event working and fix the context
switch callchains and origin ip.Example: perf record -a -e cs
Before:
10.91% ksoftirqd/0 0 [k] 0000000000000000
|
--- (nil)
perf_callchain
perf_prepare_sample
__perf_event_overflow
perf_swevent_overflow
perf_swevent_add
perf_swevent_ctx_event
do_perf_sw_event
__perf_sw_event
perf_event_task_sched_out
schedule
run_ksoftirqd
kthread
kernel_thread_helperAfter:
23.77% hald-addon-stor [kernel.kallsyms] [k] schedule
|
--- schedule
|
|--60.00%-- schedule_timeout
| wait_for_common
| wait_for_completion
| blk_execute_rq
| scsi_execute
| scsi_execute_req
| sr_test_unit_ready
| |
| |--66.67%-- sr_media_change
| | media_changed
| | cdrom_media_changed
| | sr_block_media_changed
| | check_disk_change
| | cdrom_openv2: Always build perf_arch_fetch_caller_regs() now that software
events need that too. They don't need it from modules, unlike trace
events, so we keep the EXPORT_SYMBOL in trace_event_perf.cSigned-off-by: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: David Miller -
The trace event buffer used by perf to record raw sample events
is typed as an array of char and may then not be aligned to 8
by alloc_percpu().But we need it to be aligned to 8 in sparc64 because we cast
this buffer into a random structure type built by the TRACE_EVENT()
macro to store the traces. So if a random 64 bits field is accessed
inside, it may be not under an expected good alignment.Use an array of long instead to force the appropriate alignment, and
perform a compile time check to ensure the size in byte of the buffer
is a multiple of sizeof(long) so that its actual size doesn't get
shrinked under us.This fixes unaligned accesses reported while using perf lock
in sparc 64.Suggested-by: David Miller
Suggested-by: Tejun Heo
Signed-off-by: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: David Miller
Cc: Steven Rostedt
31 Mar, 2010
1 commit
-
Network folks reported that directing all MSI-X vectors of their multi
queue NICs to a single core can cause interrupt stack overflows when
enough interrupts fire at the same time.This is caused by the fact that we run interrupt handlers by default
with interrupts enabled unless the driver reuqests the interrupt with
the IRQF_DISABLED set. The NIC handlers do not set this flag, so
simultaneous interrupts can nest unlimited and cause the stack
overflow.The only safe counter measure is to run the interrupt handlers with
interrupts disabled. We can't switch to this mode in general right
now, but it is safe to do so for MSI interrupts.Force IRQF_DISABLED for MSI interrupt handlers.
Signed-off-by: Thomas Gleixner
Cc: Andi Kleen
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Alan Cox
Cc: David Miller
Cc: Greg Kroah-Hartman
Cc: Arnaldo Carvalho de Melo
Cc: stable@kernel.org