Eric Lee / smarc-fsl-linux-kernel

07 May, 2010

1 commit

48652ced1 Merge commit 'v2.6.34-rc6' into sched/core Browse Code »

Ingo Molnar
2010-05-07 17:27:54 +0800

25 Apr, 2010

1 commit

46da27664 kernel/sys.c: fix compat uname machine ... Browse Code »

On ppc64 you get this error:

$ setarch ppc -R true
setarch: ppc: Unrecognized architecture

because uname still reports ppc64 as the machine.

So mask off the personality flags when checking for PER_LINUX32.

Signed-off-by: Andreas Schwab
Reviewed-by: Christoph Hellwig
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andreas Schwab
2010-04-25 02:31:24 +0800

23 Apr, 2010

3 commits

99bd5e2f2 sched: Fix select_idle_sibling() logic in select_task_rq_fair() ... Browse Code »

Issues in the current select_idle_sibling() logic in select_task_rq_fair()
in the context of a task wake-up:

a) Once we select the idle sibling, we use that domain (spanning the cpu that
the task is currently woken-up and the idle sibling that we found) in our
wake_affine() decisions. This domain is completely different from the
domain(we are supposed to use) that spans the cpu that the task currently
woken-up and the cpu where the task previously ran.

b) We do select_idle_sibling() check only for the cpu that the task is
currently woken-up on. If select_task_rq_fair() selects the previously run
cpu for waking the task, doing a select_idle_sibling() check
for that cpu also helps and we don't do this currently.

c) In the scenarios where the cpu that the task is woken-up is busy but
with its HT siblings are idle, we are selecting the task be woken-up
on the idle HT sibling instead of a core that it previously ran
and currently completely idle. i.e., we are not taking decisions based on
wake_affine() but directly selecting an idle sibling that can cause
an imbalance at the SMT/MC level which will be later corrected by the
periodic load balancer.

Fix this by first going through the load imbalance calculations using
wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
then choose a possible idle sibling for waking up the task on.

Signed-off-by: Suresh Siddha
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Suresh Siddha
2010-04-23 17:02:02 +0800
669c55e9f sched: Pre-compute cpumask_weight(sched_domain_span(sd)) ... Browse Code »

Dave reported that his large SPARC machines spend lots of time in
hweight64(), try and optimize some of those needless cpumask_weight()
invocations (esp. with the large offstack cpumasks these are very
expensive indeed).

Reported-by: David Miller
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-04-23 17:02:02 +0800
74f5187ac sched: Cure load average vs NO_HZ woes ... Browse Code »

Chase reported that due to us decrementing calc_load_task prematurely
(before the next LOAD_FREQ sample), the load average could be scewed
by as much as the number of CPUs in the machine.

This patch, based on Chase's patch, cures the problem by keeping the
delta of the CPU going into NO_HZ idle separately and folding that in
on the next LOAD_FREQ update.

This restores the balance and we get strict LOAD_FREQ period samples.

Signed-off-by: Peter Zijlstra
Acked-by: Chase Douglas
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-04-23 17:02:02 +0800

22 Apr, 2010

1 commit

e134d200d CRED: Fix a race in creds_are_invalid() in credentials debugging ... Browse Code »

creds_are_invalid() reads both cred->usage and cred->subscribers and then
compares them to make sure the number of processes subscribed to a cred struct
never exceeds the refcount of that cred struct.

The problem is that this can cause a race with both copy_creds() and
exit_creds() as the two counters, whilst they are of atomic_t type, are only
atomic with respect to themselves, and not atomic with respect to each other.

This means that if creds_are_invalid() can read the values on one CPU whilst
they're being modified on another CPU, and so can observe an evolving state in
which the subscribers count now is greater than the usage count a moment
before.

Switching the order in which the counts are read cannot help, so the thing to
do is to remove that particular check.

I had considered rechecking the values to see if they're in flux if the test
fails, but I can't guarantee they won't appear the same, even if they've
changed several times in the meantime.

Note that this can only happen if CONFIG_DEBUG_CREDENTIALS is enabled.

The problem is only likely to occur with multithreaded programs, and can be
tested by the tst-eintr1 program from glibc's "make check". The symptoms look
like:

CRED: Invalid credentials
CRED: At include/linux/cred.h:240
CRED: Specified credentials: ffff88003dda5878 [real][eff]
CRED: ->magic=43736564, put_addr=(null)
CRED: ->usage=766, subscr=766
CRED: ->*uid = { 0,0,0,0 }
CRED: ->*gid = { 0,0,0,0 }
CRED: ->security is ffff88003d72f538
CRED: ->security {359, 359}
------------[ cut here ]------------
kernel BUG at kernel/cred.c:850!
...
RIP: 0010:[] [] __invalid_creds+0x4e/0x52
...
Call Trace:
[] copy_creds+0x6b/0x23f

Note the ->usage=766 and subscr=766. The values appear the same because
they've been re-read since the check was made.

Reported-by: Roland McGrath
Signed-off-by: David Howells
Signed-off-by: James Morris

David Howells
2010-04-22 07:14:29 +0800

21 Apr, 2010

1 commit

eff30363c CRED: Fix double free in prepare_usermodehelper_creds() error handling ... Browse Code »

Patch 570b8fb505896e007fd3bb07573ba6640e51851d:

Author: Mathieu Desnoyers
Date: Tue Mar 30 00:04:00 2010 +0100
Subject: CRED: Fix memory leak in error handling

attempts to fix a memory leak in the error handling by making the offending
return statement into a jump down to the bottom of the function where a
kfree(tgcred) is inserted.

This is, however, incorrect, as it does a kfree() after doing put_cred() if
security_prepare_creds() fails. That will result in a double free if 'error'
is jumped to as put_cred() will also attempt to free the new tgcred record by
virtue of it being pointed to by the new cred record.

Signed-off-by: David Howells
Signed-off-by: James Morris

David Howells
2010-04-21 07:20:35 +0800

19 Apr, 2010

1 commit

bc293d62b rcu: Make RCU lockdep check the lockdep_recursion variable ... Browse Code »

The lockdep facility temporarily disables lockdep checking by
incrementing the current->lockdep_recursion variable. Such
disabling happens in NMIs and in other situations where lockdep
might expect to recurse on itself.

This patch therefore checks current->lockdep_recursion, disabling RCU
lockdep splats when this variable is non-zero. In addition, this patch
removes the "likely()", as suggested by Lai Jiangshan.

Reported-by: Frederic Weisbecker
Reported-by: David Miller
Tested-by: Frederic Weisbecker
Signed-off-by: Paul E. McKenney
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: eric.dumazet@gmail.com
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul E. McKenney
2010-04-19 14:37:19 +0800

15 Apr, 2010

2 commits

09a40af52 sched: Fix UP update_avg() build warning ... Browse Code »

update_avg() is only used for SMP builds, move it to the nearest
SMP block.

Reported-by: Stephen Rothwell
Signed-off-by: Mike Galbraith
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2010-04-15 15:36:47 +0800
b257c14ce Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: merge the latest fixes, update to -rc4.

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-04-15 15:36:16 +0800

11 Apr, 2010

1 commit

d88d4050d PM / Hibernate: user.c, fix SNAPSHOT_SET_SWAP_AREA handling ... Browse Code »

When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device
improperly by old_decode_dev and it results in an error while
hibernating with s2disk.

All users already pass the new device number, so switch to
new_decode_dev().

Signed-off-by: Jiri Slaby
Reported-and-tested-by: Jiri Kosina
Signed-off-by: "Rafael J. Wysocki"

Jiri Slaby
2010-04-11 04:28:56 +0800

08 Apr, 2010

1 commit

2aedd192f Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix sched_getaffinity()

Linus Torvalds
2010-04-08 23:37:05 +0800

07 Apr, 2010

2 commits

a3a2e76c7 mm: avoid null-pointer deref in sync_mm_rss() ... Browse Code »

- We weren't zeroing p->rss_stat[] at fork()

- Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
threads and was oopsing.

- Make __sync_task_rss_stat() static, too.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648

[akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
Reported-by: Troels Liebe Bentsen
Signed-off-by: KAMEZAWA Hiroyuki
"Michael S. Tsirkin"
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-04-07 23:38:02 +0800
94c4fcec0 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip ... Browse Code »

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Force MSI irq handlers to run with interrupts disabled

Linus Torvalds
2010-04-07 04:03:22 +0800

06 Apr, 2010

5 commits

84fba5ec9 sched: Fix sched_getaffinity() ... Browse Code »

taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with
the following error:

sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument)

This box has 128 threads and 16 bytes is enough to cover it.

Commit cd3d8031eb4311e516329aee03c79a08333141f1 (sched:
sched_getaffinity(): Allow less than NR_CPUS length) is
comparing this 16 bytes agains nr_cpu_ids.

Fix it by comparing nr_cpu_ids to the number of bits in the
cpumask we pass in.

Signed-off-by: Anton Blanchard
Reviewed-by: KOSAKI Motohiro
Cc: Sharyathi Nagesh
Cc: Ulrich Drepper
Cc: Peter Zijlstra
Cc: Linus Torvalds
Cc: Jack Steiner
Cc: Russ Anderson
Cc: Mike Travis
LKML-Reference:
Signed-off-by: Ingo Molnar

Anton Blanchard
2010-04-06 16:01:35 +0800
5fbfb18d7 Fix up possibly racy module refcounting ... Browse Code »

Module refcounting is implemented with a per-cpu counter for speed.
However there is a race when tallying the counter where a reference may
be taken by one CPU and released by another. Reference count summation
may then see the decrement without having seen the previous increment,
leading to lower than expected count. A module which never has its
actual reference drop below 1 may return a reference count of 0 due to
this race.

Module removal generally runs under stop_machine, which prevents this
race causing bugs due to removal of in-use modules. However there are
other real bugs in module.c code and driver code (module_refcount is
exported) where the callers do not run under stop_machine.

Fix this by maintaining running per-cpu counters for the number of
module refcount increments and the number of refcount decrements. The
increments are tallied after the decrements, so any decrement seen will
always have its corresponding increment counted. The final refcount is
the difference of the total increments and decrements, preventing a
low-refcount from being returned.

Signed-off-by: Nick Piggin
Acked-by: Rusty Russell
Signed-off-by: Linus Torvalds

Nick Piggin
2010-04-06 10:50:02 +0800
449cedf09 audit: preface audit printk with audit ... Browse Code »

There have been a number of reports of people seeing the message:
"name_count maxed, losing inode data: dev=00:05, inode=3185"
in dmesg. These usually lead to people reporting problems to the filesystem
group who are in turn clueless what they mean.

Eventually someone finds me and I explain what is going on and that
these come from the audit system. The basics of the problem is that the
audit subsystem never expects a single syscall to 'interact' (for some
wish washy meaning of interact) with more than 20 inodes. But in fact
some operations like loading kernel modules can cause changes to lots of
inodes in debugfs.

There are a couple real fixes being bandied about including removing the
fixed compile time limit of 20 or not auditing changes in debugfs (or
both) but neither are small and obvious so I am not sending them for
immediate inclusion (I hope Al forwards a real solution next devel
window).

In the meantime this patch simply adds 'audit' to the beginning of the
crap message so if a user sees it, they come blame me first and we can
talk about what it means and make sure we understand all of the reasons
it can happen and make sure this gets solved correctly in the long run.

Signed-off-by: Eric Paris
Signed-off-by: Linus Torvalds

Eric Paris
2010-04-06 04:19:45 +0800
b66696e3c Merge branch 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc ... Browse Code »

* 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc:
eeepc-wmi: include slab.h
staging/otus: include slab.h from usbdrv.h
percpu: don't implicitly include slab.h from percpu.h
kmemcheck: Fix build errors due to missing slab.h
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
iwlwifi: don't include iwl-dev.h from iwl-devtrace.h
x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h

Fix up trivial conflicts in include/linux/percpu.h due to
is_kernel_percpu_address() having been introduced since the slab.h
cleanup with the percpu_up.c splitup.

Linus Torvalds
2010-04-06 00:39:11 +0800
9e74e7c81 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
module: add stub for is_module_percpu_address
percpu, module: implement and use is_kernel/module_percpu_address()
module: encapsulate percpu handling better and record percpu_size

Linus Torvalds
2010-04-06 00:16:37 +0800

05 Apr, 2010

4 commits

336f5899d Merge branch 'master' into export-slabh Browse Code »

Tejun Heo
2010-04-05 10:37:28 +0800
8ce42c8b7 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Always build the powerpc perf_arch_fetch_caller_regs version
perf: Always build the stub perf_arch_fetch_caller_regs version
perf, probe-finder: Build fix on Debian
perf/scripts: Tuple was set from long in both branches in python_process_event()
perf: Fix 'perf sched record' deadlock
perf, x86: Fix callgraphs of 32-bit processes on 64-bit kernels
perf, x86: Fix AMD hotplug & constraint initialization
x86: Move notify_cpu_starting() callback to a later stage
x86,kgdb: Always initialize the hw breakpoint attribute
perf: Use hot regs with software sched switch/migrate events
perf: Correctly align perf event tracing buffer

Linus Torvalds
2010-04-05 03:13:10 +0800
0121b0c77 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock
sched: Fix proc_sched_set_task()

Linus Torvalds
2010-04-05 03:12:31 +0800
a8941b0ed Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/ker… ... Browse Code »

…nel/git/tip/linux-2.6-tip

* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ring-buffer: Add missing unlock
tracing: Fix lockdep warning in global_clock()

Linus Torvalds
2010-04-05 03:12:19 +0800

03 Apr, 2010

17 commits

26d80aa78 perf: Always build the stub perf_arch_fetch_caller_regs version ... Browse Code »

Now that software events use perf_arch_fetch_caller_regs() too, we
need the stub version to be always built in for archs that don't
implement it.

Fixes the following build error in PARISC:

kernel/built-in.o: In function `perf_event_task_sched_out':
(.text.perf_event_task_sched_out+0x54): undefined reference to `perf_arch_fetch_caller_regs'

Reported-by: Ingo Molnar
Signed-off-by: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Paul Mackerras

Frederic Weisbecker
2010-04-03 18:22:05 +0800
5e123e5d9 Merge branch 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb ... Browse Code »

* 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdb: Turn off tracing while in the debugger
kgdb: use atomic_inc and atomic_dec instead of atomic_set
kgdb: eliminate kgdb_wait(), all cpus enter the same way
kgdbts,sh: Add in breakpoint pc offset for superh
kgdb: have ebin2mem call probe_kernel_write once

Linus Torvalds
2010-04-03 10:45:05 +0800
24b99d157 Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 ... Browse Code »

* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
Freezer: Fix buggy resume test for tasks frozen with cgroup freezer
Freezer: Only show the state of tasks refusing to freeze

Linus Torvalds
2010-04-03 10:44:42 +0800
4da75b9ce kgdb: Turn off tracing while in the debugger ... Browse Code »

The kernel debugger should turn off kernel tracing any time the
debugger is active and restore it on resume.

Signed-off-by: Jason Wessel
Reviewed-by: Steven Rostedt

Jason Wessel
2010-04-03 03:58:19 +0800
ae6bf53e0 kgdb: use atomic_inc and atomic_dec instead of atomic_set ... Browse Code »

Memory barriers should be used for the kgdb cpu synchronization. The
atomic_set() does not imply a memory barrier.

Reported-by: Will Deacon
Signed-off-by: Jason Wessel

Jason Wessel
2010-04-03 03:58:18 +0800
62fae3121 kgdb: eliminate kgdb_wait(), all cpus enter the same way ... Browse Code »

This is a kgdb architectural change to have all the cpus (master or
slave) enter the same function.

A cpu that hits an exception (wants to be the master cpu) will call
kgdb_handle_exception() from the trap handler and then invoke a
kgdb_roundup_cpu() to synchronize the other cpus and bring them into
the kgdb_handle_exception() as well.

A slave cpu will enter kgdb_handle_exception() from the
kgdb_nmicallback() and set the exception state to note that the
processor is a slave.

Previously the salve cpu would have called kgdb_wait(). This change
allows the debug core to change cpus without resuming the system in
order to inspect arch specific cpu information.

Signed-off-by: Jason Wessel

Jason Wessel
2010-04-03 03:58:18 +0800
a0279bd58 kgdb: have ebin2mem call probe_kernel_write once ... Browse Code »

Rather than call probe_kernel_write() one byte at a time, process the
whole buffer locally and pass the entire result in one go. This way,
architectures that need to do special handling based on the length can
do so, or we only end up calling memcpy() once.

[sonic.zhang@analog.com: Reported original problem and preliminary patch]

Signed-off-by: Jason Wessel
Signed-off-by: Sonic Zhang
Signed-off-by: Mike Frysinger

Jason Wessel
2010-04-03 03:58:17 +0800
371fd7e7a sched: Add enqueue/dequeue flags ... Browse Code »

In order to reduce the dependency on TASK_WAKING rework the enqueue
interface to support a proper flags field.

Replace the int wakeup, bool head arguments with an int flags argument
and create the following flags:

ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
ENQUEUE_WAKING - the enqueue has relative vruntime due to
having sched_class::task_waking() called,
ENQUEUE_HEAD - the waking task should be places on the head
of the priority queue (where appropriate).

For symmetry also convert sched_class::dequeue() to a flags scheme.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-04-03 02:12:05 +0800
cc87f76a6 sched: Fix nr_uninterruptible count ... Browse Code »

The cpuload calculation in calc_load_account_active() assumes
rq->nr_uninterruptible will not change on an offline cpu after
migrate_nr_uninterruptible(). However the recent migrate on wakeup
changes broke that and would result in decrementing the offline cpu's
rq->nr_uninterruptible.

Fix this by accounting the nr_uninterruptible on the waking cpu.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-04-03 02:12:04 +0800
65cc8e485 sched: Optimize task_rq_lock() ... Browse Code »

Now that we hold the rq->lock over set_task_cpu() again, we can do
away with most of the TASK_WAKING checks and reduce them again to
set_cpus_allowed_ptr().

Removes some conditionals from scheduling hot-paths.

Signed-off-by: Peter Zijlstra
Cc: Oleg Nesterov
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-04-03 02:12:04 +0800
0017d7350 sched: Fix TASK_WAKING vs fork deadlock ... Browse Code »

Oleg noticed a few races with the TASK_WAKING usage on fork.

- since TASK_WAKING is basically a spinlock, it should be IRQ safe
- since we set TASK_WAKING (*) without holding rq->lock it could
be there still is a rq->lock holder, thereby not actually
providing full serialization.

(*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.

Cure the second issue by not setting TASK_WAKING in sched_fork(), but
only temporarily in wake_up_new_task() while calling select_task_rq().

Cure the first by holding rq->lock around the select_task_rq() call,
this will disable IRQs, this however requires that we push down the
rq->lock release into select_task_rq_fair()'s cgroup stuff.

Because select_task_rq_fair() still needs to drop the rq->lock we
cannot fully get rid of TASK_WAKING.

Reported-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-04-03 02:12:03 +0800
9084bb824 sched: Make select_fallback_rq() cpuset friendly ... Browse Code »

Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
with select_fallback_rq(). It can be called from any context and can't use
any cpuset locks including task_lock(). It is called when the task doesn't
have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
suitable cpu.

I am not proud of this patch. Everything which needs such a fat comment
can't be good even if correct. But I'd prefer to not change the locking
rules in the code I hardly understand, and in any case I believe this
simple change make the code much more correct compared to deadlocks we
currently have.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-04-03 02:12:03 +0800
6a1bdc1b5 sched: _cpu_down(): Don't play with current->cpus_allowed ... Browse Code »

_cpu_down() changes the current task's affinity and then recovers it at
the end. The problems are well known: we can't restore old_allowed if it
was bound to the now-dead-cpu, and we can race with the userspace which
can change cpu-affinity during unplug.

_cpu_down() should not play with current->cpus_allowed at all. Instead,
take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
removes the dying cpu from cpu_online_mask.

Signed-off-by: Oleg Nesterov
Acked-by: Rafael J. Wysocki
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-04-03 02:12:03 +0800
30da688ef sched: sched_exec(): Remove the select_fallback_rq() logic ... Browse Code »

sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless.
This can race with other CPUs updating our ->cpus_allowed, and this
looks meaningless to me.

The task is current and running, it must have online cpus in ->cpus_allowed,
the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu,
this likely means we raced with set_cpus_allowed() which was called
for reason, why should sched_exec() retry and call ->select_task_rq()
again?

Change the code to call sched_class->select_task_rq() directly and do
nothing if the returned cpu is wrong after re-checking under rq->lock.

From now task_struct->cpus_allowed is always stable under TASK_WAKING,
select_fallback_rq() is always called under rq-lock or the caller or
the caller owns TASK_WAKING (select_task_rq).

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-04-03 02:12:02 +0800
c1804d547 sched: move_task_off_dead_cpu(): Remove retry logic ... Browse Code »

The previous patch preserved the retry logic, but it looks unneeded.

__migrate_task() can only fail if we raced with migration after we dropped
the lock, but in this case the caller of set_cpus_allowed/etc must initiate
migration itself if ->on_rq == T.

We already fixed p->cpus_allowed, the changes in active/online masks must
be visible to racer, it should migrate the task to online cpu correctly.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-04-03 02:12:02 +0800
1445c08d0 sched: move_task_off_dead_cpu(): Take rq->lock around select_fallback_rq() ... Browse Code »

move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
lockless. We can race with set_cpus_allowed() running in parallel.

Change it to take rq->lock around select_fallback_rq(). Note that it is not
trivial to move this spin_lock() into select_fallback_rq(), we must recheck
the task was not migrated after we take the lock and other callers do not
need this lock.

To avoid the races with other callers of select_fallback_rq() which rely on
TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
CPU anyway, and the subsequent __migrate_task() is just meaningless because
p->se.on_rq must be false.

Alternatively, we could change select_task_rq() to take rq->lock right
after it calls sched_class->select_task_rq(), but this looks a bit ugly.

Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-04-03 02:12:01 +0800
897f0b3c3 sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code ... Browse Code »

This patch just states the fact the cpusets/cpuhotplug interaction is
broken and removes the deadlockable code which only pretends to work.

- cpuset_lock() doesn't really work. It is needed for
cpuset_cpus_allowed_locked() but we can't take this lock in
try_to_wake_up()->select_fallback_rq() path.

- cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
cpuset_lock() and hangs forever because CPU is already dead and thus
T can't be scheduled.

- cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
which is not irq-safe, but try_to_wake_up() can be called from irq.

Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
we currently do without CONFIG_CPUSETS.

Also, with or without this patch, with or without CONFIG_CPUSETS, the
callers of select_fallback_rq() can race with each other or with
set_cpus_allowed() pathes.

The subsequent patches try to to fix these problems.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-04-03 02:12:01 +0800