Eric Lee / smarc-fsl-linux-kernel

07 Nov, 2011

1 commit

1197ab294 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc ... Browse Code »

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
powerpc/p3060qds: Add support for P3060QDS board
powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX
powerpc/85xx: Make kexec to interate over online cpus
powerpc/fsl_booke: Fix comment in head_fsl_booke.S
powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices
powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver
powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver
powerpc/86xx: Correct Gianfar support for GE boards
powerpc/cpm: Clear muram before it is in use.
drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager
powerpc/fsl_msi: add support for "msi-address-64" property
powerpc/85xx: Setup secondary cores PIR with hard SMP id
powerpc/fsl-booke: Fix settlbcam for 64-bit
powerpc/85xx: Adding DCSR node to dtsi device trees
powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
powerpc/85xx: fix PHYS_64BIT selection for P1022DS
powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map
powerpc: respect mem= setting for early memory limit setup
powerpc: Update corenet64_smp_defconfig
powerpc: Update mpc85xx/corenet 32-bit defconfigs
...

Fix up trivial conflicts in:
- arch/powerpc/configs/40x/hcu4_defconfig
removed stale file, edited elsewhere
- arch/powerpc/include/asm/udbg.h, arch/powerpc/kernel/udbg.c:
added opal and gelic drivers vs added ePAPR driver
- drivers/tty/serial/8250.c
moved UPIO_TSI to powerpc vs removed UPIO_DWAPB support

Linus Torvalds
2011-11-07 09:12:03 +0800

26 Oct, 2011

2 commits

8a4a8918e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
llist: Add back llist_add_batch() and llist_del_first() prototypes
sched: Don't use tasklist_lock for debug prints
sched: Warn on rt throttling
sched: Unify the ->cpus_allowed mask copy
sched: Wrap scheduler p->cpus_allowed access
sched: Request for idle balance during nohz idle load balance
sched: Use resched IPI to kick off the nohz idle balance
sched: Fix idle_cpu()
llist: Remove cpu_relax() usage in cmpxchg loops
sched: Convert to struct llist
llist: Add llist_next()
irq_work: Use llist in the struct irq_work logic
llist: Return whether list is empty before adding in llist_add()
llist: Move cpu_relax() to after the cmpxchg()
llist: Remove the platform-dependent NMI checks
llist: Make some llist functions inline
sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch
sched: Remove redundant test in check_preempt_tick()
sched: Add documentation for bandwidth control
sched: Return unused runtime on group dequeue
...

Linus Torvalds
2011-10-26 23:08:43 +0800
19b4a8d52 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()
rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states
rcu: Wire up RCU_BOOST_PRIO for rcutree
rcu: Make rcu_torture_boost() exit loops at end of test
rcu: Make rcu_torture_fqs() exit loops at end of test
rcu: Permit rt_mutex_unlock() with irqs disabled
rcu: Avoid having just-onlined CPU resched itself when RCU is idle
rcu: Suppress NMI backtraces when stall ends before dump
rcu: Prohibit grace periods during early boot
rcu: Simplify unboosting checks
rcu: Prevent early boot set_need_resched() from __rcu_pending()
rcu: Dump local stack if cannot dump all CPUs' stacks
rcu: Move __rcu_read_unlock()'s barrier() within if-statement
rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentation
rcu: Make rcu_assign_pointer() unconditionally insert a memory barrier
rcu: Make rcu_implicit_dynticks_qs() locals be correct size
rcu: Eliminate in_irq() checks in rcu_enter_nohz()
nohz: Remove nohz_cpu_mask
rcu: Document interpretation of RCU-lockdep splats
rcu: Allow rcutorture's stat_interval parameter to be changed at runtime
...

Linus Torvalds
2011-10-26 22:26:53 +0800

25 Oct, 2011

1 commit

59e525341 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
MAINTAINERS: linux-m32r is moderated for non-subscribers
linux@lists.openrisc.net is moderated for non-subscribers
Drop default from "DM365 codec select" choice
parisc: Kconfig: cleanup Kernel page size default
Kconfig: remove redundant CONFIG_ prefix on two symbols
cris: remove arch/cris/arch-v32/lib/nand_init.S
microblaze: add missing CONFIG_ prefixes
h8300: drop puzzling Kconfig dependencies
MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
tty: drop superfluous dependency in Kconfig
ARM: mxc: fix Kconfig typo 'i.MX51'
Fix file references in Kconfig files
aic7xxx: fix Kconfig references to READMEs
Fix file references in drivers/ide/
thinkpad_acpi: Fix printk typo 'bluestooth'
bcmring: drop commented out line in Kconfig
btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
doc: raw1394: Trivial typo fix
CIFS: Don't free volume_info->UNC until we are entirely done with it.
treewide: Correct spelling of successfully in comments
...

Linus Torvalds
2011-10-25 18:11:02 +0800

06 Oct, 2011

5 commits

510f5acc4 sched: Don't use tasklist_lock for debug prints ... Browse Code »

Avoid taking locks from debug prints, this avoids latencies on -rt,
and improves reliability of the debug code.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Thomas Gleixner
2011-10-06 18:47:08 +0800
4939602a2 sched: Unify the ->cpus_allowed mask copy ... Browse Code »

Currently every sched_class::set_cpus_allowed() implementation has to
copy the cpumask into task_struct::cpus_allowed, this is pointless,
put this copy in the generic code.

Signed-off-by: Peter Zijlstra
Acked-by: Thomas Gleixner
Link: http://lkml.kernel.org/n/tip-jhl5s9fckd9ptw1fzbqqlrd3@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-10-06 18:47:00 +0800
fa17b507f sched: Wrap scheduler p->cpus_allowed access ... Browse Code »

This task is preparatory for the migrate_disable() implementation, but
stands on its own and provides a cleanup.

It currently only converts those sites required for task-placement.
Kosaki-san once mentioned replacing cpus_allowed with a proper
cpumask_t instead of the NR_CPUS sized array it currently is, that
would also require something like this.

Signed-off-by: Peter Zijlstra
Acked-by: Thomas Gleixner
Cc: KOSAKI Motohiro
Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-10-06 18:46:56 +0800
6eb57e0d6 sched: Request for idle balance during nohz idle load balance ... Browse Code »

rq's idle_at_tick is set to idle/busy during the timer tick
depending on the cpu was idle or not. This will be used later in the load
balance that will be done in the softirq context (which is a process
context in -RT kernels).

For nohz kernels, for the cpu doing nohz idle load balance on behalf of
all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
recorded when it got the timer tick presumably when it is busy).

As the nohz idle load balancing is also being done at the same place
as the regular load balancing, nohz idle load balancing was bailing out
when it sees rq's idle_at_tick not set.

Thus leading to poor system utilization.

Rename rq's idle_at_tick to idle_balance and set it when someone requests
for nohz idle balance on an idle cpu.

Reported-by: Srivatsa Vaddagiri
Signed-off-by: Suresh Siddha
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar

Suresh Siddha
2011-10-06 18:46:27 +0800
ca38062e5 sched: Use resched IPI to kick off the nohz idle balance ... Browse Code »

Current use of smp call function to kick the nohz idle balance can deadlock
in this scenario.

1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
data (csd) to the call single queue, cpu-A took a timer interrupt. Actual IPI
to cpu-B to process the call single queue is not yet sent.

2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
and __smp_call_function_single() with nowait will queue the csd to the
cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
as the call single queue was not empty.

3. cpu-A is busy with lot of interrupts

4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
idle load balancer and clear its rq->nohz_balance_kick.

5. At this point, csd queued as part of the step-2 above is still locked
and waiting to be serviced on cpu-B.

6. cpu-A is still busy with interrupt load and now it got another timer
interrupt and as part of it decided to kick cpu-B for another idle load
balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
above) and does __smp_call_function_single() with the same csd that is
still locked.

7. And we get a deadlock waiting for the csd_lock() in the
__smp_call_function_single().

Main issue here is that cpu-B can service the idle load balancer kick
request from cpu-A even with out receiving the IPI and this lead to
doing multiple __smp_call_function_single() on the same csd leading to
deadlock.

To kick a cpu, scheduler already has the reschedule vector reserved. Use
that mechanism (kick_process()) instead of using the generic smp call function
mechanism to kick off the nohz idle load balancing and avoid the deadlock.

[ This issue is present from 2.6.35+ kernels, but marking it -stable
only from v3.0+ as the proposed fix depends on the scheduler_ipi()
that is introduced recently. ]

Reported-by: Prarit Bhargava
Signed-off-by: Suresh Siddha
Cc: stable@kernel.org # v3.0+
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar

Suresh Siddha
2011-10-06 18:46:23 +0800

04 Oct, 2011

3 commits

908a32837 sched: Fix idle_cpu() ... Browse Code »

On -rt we observed hackbench waking all 400 tasks to a single cpu.
This is because of select_idle_sibling()'s interaction with the new
ipi based wakeup scheme.

The existing idle_cpu() test only checks to see if the current task on
that cpu is the idle task, it does not take already queued tasks into
account, nor does it take queued to be woken tasks into account.

If the remote wakeup IPIs come hard enough, there won't be time to
schedule away from the idle task, and would thus keep thinking the cpu
was in fact idle, regardless of the fact that there were already
several hundred tasks runnable.

We couldn't reproduce on mainline, but there's no reason it couldn't
happen.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-3o30p18b2paswpc9ohy2gltp@git.kernel.org
Signed-off-by: Ingo Molnar

Thomas Gleixner
2011-10-04 18:44:07 +0800
fa14ff4ac sched: Convert to struct llist ... Browse Code »

Use the generic llist primitives.

We had a private lockless list implementation in the scheduler in the wake-list
code, now that we have a generic llist implementation that provides all required
operations, switch to it.

This patch is not expected to change any behavior.

Signed-off-by: Peter Zijlstra
Cc: Huang Ying
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-10-04 18:43:58 +0800
22f92bacb Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: pick up the latest fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-10-04 17:09:08 +0800

30 Sep, 2011

1 commit

d670ec131 posix-cpu-timers: Cure SMP wobbles ... Browse Code »
1

David reported:

Attached below is a watered-down version of rt/tst-cpuclock2.c from
GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
similar.

Run it several times, and you will see cases where the main thread
will measure a process clock difference before and after the nanosleep
which is smaller than the cpu-burner thread's individual thread clock
difference. This doesn't make any sense since the cpu-burner thread
is part of the top-level process's thread group.

I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
64-bit binaries).

For example:

[davem@boricha build-x86_64-linux]$ ./test
process: before(0.001221967) after(0.498624371) diff(497402404)
thread: before(0.000081692) after(0.498316431) diff(498234739)
self: before(0.001223521) after(0.001240219) diff(16698)
[davem@boricha build-x86_64-linux]$

The diff of 'process' should always be >= the diff of 'thread'.

I make sure to wrap the 'thread' clock measurements the most tightly
around the nanosleep() call, and that the 'process' clock measurements
are the outer-most ones.

---
#include
#include
#include
#include
#include
#include
#include
#include

static pthread_barrier_t barrier;

static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1)
__asm__ __volatile__("" : : : "memory");
return NULL;
}

int main(void)
{
clockid_t process_clock, my_thread_clock, th_clock;
struct timespec process_before, process_after;
struct timespec me_before, me_after;
struct timespec th_before, th_after;
struct timespec sleeptime;
unsigned long diff;
pthread_t th;
int err;

err = clock_getcpuclockid(0, &process_clock);
if (err)
return 1;

err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
if (err)
return 1;

pthread_barrier_init(&barrier, NULL, 2);
err = pthread_create(&th, NULL, chew_cpu, NULL);
if (err)
return 1;

err = pthread_getcpuclockid(th, &th_clock);
if (err)
return 1;

pthread_barrier_wait(&barrier);

err = clock_gettime(process_clock, &process_before);
if (err)
return 1;

err = clock_gettime(my_thread_clock, &me_before);
if (err)
return 1;

err = clock_gettime(th_clock, &th_before);
if (err)
return 1;

sleeptime.tv_sec = 0;
sleeptime.tv_nsec = 500000000;
nanosleep(&sleeptime, NULL);

err = clock_gettime(th_clock, &th_after);
if (err)
return 1;

err = clock_gettime(my_thread_clock, &me_after);
if (err)
return 1;

err = clock_gettime(process_clock, &process_after);
if (err)
return 1;

diff = process_after.tv_nsec - process_before.tv_nsec;
printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
process_before.tv_sec, process_before.tv_nsec,
process_after.tv_sec, process_after.tv_nsec, diff);
diff = th_after.tv_nsec - th_before.tv_nsec;
printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
th_before.tv_sec, th_before.tv_nsec,
th_after.tv_sec, th_after.tv_nsec, diff);
diff = me_after.tv_nsec - me_before.tv_nsec;
printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
me_before.tv_sec, me_before.tv_nsec,
me_after.tv_sec, me_after.tv_nsec, diff);

return 0;
}

This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.

This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().

Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.

Reported-by: David Miller
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
Tested-by: David Miller
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2011-09-30 20:07:06 +0800

29 Sep, 2011

2 commits

fc0763f53 nohz: Remove nohz_cpu_mask ... Browse Code »

RCU no longer uses this global variable, nor does anyone else. This
commit therefore removes this variable. This reduces memory footprint
and also removes some atomic instructions and memory barriers from
the dyntick-idle path.

Signed-off-by: Alex Shi
Signed-off-by: Paul E. McKenney

Shi, Alex
2011-09-29 12:38:29 +0800
b3fbab057 rcu: Restore checks for blocking in RCU read-side critical sections ... Browse Code »

Long ago, using TREE_RCU with PREEMPT would result in "scheduling
while atomic" diagnostics if you blocked in an RCU read-side critical
section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
this diagnostic. This commit therefore adds a replacement diagnostic
based on PROVE_RCU.

Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
used for things that have nothing to do with rcu_dereference(), rename
lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
argument that is a string indicating what is suspicious. This third
argument is passed in from a new third argument to rcu_lockdep_assert().
Update all calls to rcu_lockdep_assert() to add an informative third
argument.

Also, add a pair of rcu_lockdep_assert() calls from within
rcu_note_context_switch(), one complaining if a context switch occurs
in an RCU-bh read-side critical section and another complaining if a
context switch occurs in an RCU-sched read-side critical section.
These are present only if the PROVE_RCU kernel parameter is enabled.

Finally, fix some checkpatch whitespace complaints in lockdep.c.

Again, you must enable PROVE_RCU to see these new diagnostics. But you
are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2011-09-29 12:36:37 +0800

28 Sep, 2011

1 commit

bfb9035c9 treewide: Correct spelling of successfully in comments ... Browse Code »

Signed-off-by: Joe Perches
Signed-off-by: Jiri Kosina

Joe Perches
2011-09-28 00:08:04 +0800

26 Sep, 2011

1 commit

6ebbe7a07 sched: Fix up wchan borkage ... Browse Code »
1

Commit c259e01a1ec ("sched: Separate the scheduler entry for
preemption") contained a boo-boo wrecking wchan output. It forgot to
put the new schedule() function in the __sched section and thereby
doesn't get properly ignored for things like wchan.

Tested-by: Simon Kirby
Cc: stable@kernel.org # 2.6.39+
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca
Signed-off-by: Ingo Molnar

Simon Kirby
2011-09-26 18:51:08 +0800

20 Sep, 2011

1 commit

590e4d857 sched: Allow SD_NODES_PER_DOMAIN to be overridden ... Browse Code »

We want to override the default value of SD_NODES_PER_DOMAIN on ppc64,
so move it into linux/topology.h.

Signed-off-by: Anton Blanchard
Acked-by: Peter Zijlstra
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2011-09-20 13:53:21 +0800

18 Sep, 2011

1 commit

bfa322c48 Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: We are queueing up a dependent patch.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-09-18 20:01:39 +0800

08 Sep, 2011

1 commit

e81b693c0 Merge branch 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip ... Browse Code »

* 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
sched: Fix a memory leak in __sdt_free()
sched: Move blk_schedule_flush_plug() out of __schedule()
sched: Separate the scheduler entry for preemption

Linus Torvalds
2011-09-08 04:01:34 +0800

29 Aug, 2011

4 commits

a8d757ef0 perf events: Fix slow and broken cgroup context switch code ... Browse Code »

The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:

$ ./pong
Both processes pinned to CPU1, running for 10s
10684.51 ctxsw/s

Now start a cgroup perf stat:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

$ ./pong
Both processes pinned to CPU1, running for 10s
6674.61 ctxsw/s

That's a 37% penalty.

Note that pong is not even in the monitored cgroup.

The results shown by perf stat are bogus:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

Performance counter stats for 'sleep 100':

CPU1 cycles test
CPU1 16,984,189,138 cycles # 0.000 GHz

The second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.

The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.

With this patch the same test now yields:
$ ./pong
Both processes pinned to CPU1, running for 10s
10775.30 ctxsw/s

Start perf stat with cgroup:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Run pong outside the cgroup:
$ /pong
Both processes pinned to CPU1, running for 10s
10687.80 ctxsw/s

The penalty is now less than 2%.

And the results for perf stat are correct:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Performance counter stats for 'sleep 10':

CPU1 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz

Now perf stat reports the correct counts for
for the non cgroup event.

If we run pong inside the cgroup, then we also get the
correct counts:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Performance counter stats for 'sleep 10':

CPU1 22,297,726,205 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz

10.001457237 seconds time elapsed

Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar

Stephane Eranian
2011-08-29 18:28:33 +0800
feff8fa00 sched: Fix a memory leak in __sdt_free() ... Browse Code »
1

This patch fixes the following memory leak:

unreferenced object 0xffff880107266800 (size 512):
comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] create_object+0x187/0x28b
[] kmemleak_alloc+0x73/0x98
[] __kmalloc_node+0x104/0x159
[] kzalloc_node.clone.97+0x15/0x17
[] build_sched_domains+0xb7/0x7f3
[] partition_sched_domains+0x1db/0x24a
[] do_rebuild_sched_domains+0x3b/0x47
[] rebuild_sched_domains+0x10/0x12
[] sched_power_savings_store+0x6c/0x7b
[] sched_mc_power_savings_store+0x16/0x18
[] sysdev_class_store+0x20/0x22
[] sysfs_write_file+0x108/0x144
[] vfs_write+0xaf/0x102
[] sys_write+0x4d/0x74
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff

Signed-off-by: WANG Cong
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org # 3.0
Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com
Signed-off-by: Ingo Molnar

WANG Cong
2011-08-29 18:27:01 +0800
9c40cef2b sched: Move blk_schedule_flush_plug() out of __schedule() ... Browse Code »
1

There is no real reason to run blk_schedule_flush_plug() with
interrupts and preemption disabled.

Move it into schedule() and call it when the task is going voluntarily
to sleep. There might be false positives when the task is woken
between that call and actually scheduling, but that's not really
different from being woken immediately after switching away.

This fixes a deadlock in the scheduler where the
blk_schedule_flush_plug() callchain enables interrupts and thereby
allows a wakeup to happen of the task that's going to sleep.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Cc: Jens Axboe
Cc: Linus Torvalds
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org
Signed-off-by: Ingo Molnar

Thomas Gleixner
2011-08-29 18:26:59 +0800
c259e01a1 sched: Separate the scheduler entry for preemption ... Browse Code »
2

Block-IO and workqueues call into notifier functions from the
scheduler core code with interrupts and preemption disabled. These
calls should be made before entering the scheduler core.

To simplify this, separate the scheduler core code into
__schedule(). __schedule() is directly called from the places which
set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
checks into schedule(), so they are only called when a task voluntary
goes to sleep.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Cc: Jens Axboe
Cc: Linus Torvalds
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2011-08-29 18:26:57 +0800

14 Aug, 2011

15 commits

d8b4986d3 sched: Return unused runtime on group dequeue ... Browse Code »

When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.

We do this only when the quota is current and there is more than
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.

Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:

no quota generations:

197.47 ms /cgroup/a/cpuacct.usage
199.46 ms /cgroup/a/cpuacct.usage
205.46 ms /cgroup/a/cpuacct.usage
198.46 ms /cgroup/a/cpuacct.usage
208.39 ms /cgroup/a/cpuacct.usage

Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.

with quota generations [1s increments]:

119.58 ms /cgroup/a/cpuacct.usage
119.65 ms /cgroup/a/cpuacct.usage
119.64 ms /cgroup/a/cpuacct.usage
119.63 ms /cgroup/a/cpuacct.usage
119.60 ms /cgroup/a/cpuacct.usage

The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota. The cost is that this quota
becomes unavailable.

with quota generations and quota return:

200.09 ms /cgroup/a/cpuacct.usage
200.09 ms /cgroup/a/cpuacct.usage
198.09 ms /cgroup/a/cpuacct.usage
200.09 ms /cgroup/a/cpuacct.usage
200.06 ms /cgroup/a/cpuacct.usage

By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from
previous quota periods (especially on a large machine).

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:54 +0800
e8da1b18b sched: Add exports tracking cfs bandwidth control statistics ... Browse Code »

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods: number of periods in which execution occurred
nr_throttled: the number of periods above in which execution was throttle
throttled_time: cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner
Signed-off-by: Nikhil Rao
Signed-off-by: Bharata B Rao
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com
Signed-off-by: Ingo Molnar

Nikhil Rao
2011-08-14 18:03:49 +0800
8cb120d3e sched: Migrate throttled tasks on HOTPLUG ... Browse Code »

Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task(). The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.

Resolve this by unthrottling offline cpus so that threads can be migrated.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:44 +0800
64660c864 sched: Prevent interactions with throttled entities ... Browse Code »

From the perspective of load-balance and shares distribution, throttled
entities should be invisible.

However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present. In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.

Instead, track hierarchicaal throttled state at time of transition. This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.

Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.

We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:40 +0800
8277434ef sched: Allow for positional tg_tree walks ... Browse Code »

Extend walk_tg_tree to accept a positional argument

static int walk_tg_tree_from(struct task_group *from,
tg_visitor down, tg_visitor up, void *data)

Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:38 +0800
671fd9dab sched: Add support for unthrottling group entities ... Browse Code »

At the start of each period we refresh the global bandwidth pool. At this time
we must also unthrottle any cfs_rq entities who are now within bandwidth once
more (as quota permits).

Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
and their entities re-enqueued.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:36 +0800
85dac906b sched: Add support for throttling group entities ... Browse Code »

Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.

Throttled entities are dequeued to prevent scheduling, additionally we mark
them as throttled (using cfs_rq->throttled) to prevent them from becoming
re-enqueued until they are unthrottled. A list of a task_group's throttled
entities are maintained on the cfs_bandwidth structure.

Note: While the machinery for throttling is added in this patch the act of
throttling an entity exceeding its bandwidth is deferred until later within
the series.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:34 +0800
a9cf55b28 sched: Expire invalid runtime ... Browse Code »

Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.

We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.

One catch is that the direction of spread on sched_clock is undefined,
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.

Fortunately we can differentiate these by considering whether the
global deadline has advanced. If it has not, then we assume our clock to be
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:31 +0800
58088ad01 sched: Add a timer to handle CFS bandwidth refresh ... Browse Code »

This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.

Since the RT pool is using a similar timer there's some small refactoring to
share this support.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:28 +0800
ec12cb7f3 sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth ... Browse Code »

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

Signed-off-by: Paul Turner
Signed-off-by: Nikhil Rao
Signed-off-by: Bharata B Rao
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:26 +0800
a790de995 sched: Validate CFS quota hierarchies ... Browse Code »

Add constraints validation for CFS bandwidth hierarchies.

Validate that:
max(child bandwidth)
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:22 +0800
ab84d31e1 sched: Introduce primitives to account for CFS bandwidth tracking ... Browse Code »

In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.

- The global bandwidth is per task_group, it represents a pool of unclaimed
bandwidth that cfs_rqs can allocate from.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
to consume over period above.

Signed-off-by: Paul Turner
Signed-off-by: Nikhil Rao
Signed-off-by: Bharata B Rao
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:20 +0800
953bfcd10 sched: Implement hierarchical task accounting for SCHED_OTHER ... Browse Code »

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations. This in turn leads to incorrect idle and
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:01:13 +0800
c350a04ef sched: fix broken SCHED_RESET_ON_FORK handling ... Browse Code »

Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has
been handled for an RT parent gives birth to a deranged mutant child with
non-RT policy, but RT prio and sched_class.

Move PI leakage protection up, always set priorities and weight, and if the
child is leaving RT class, reset rt_priority to the proper value.

Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.net
Signed-off-by: Ingo Molnar

Mike Galbraith
2011-08-14 18:00:43 +0800
e2b245f89 sched: Remove rq->avg_load_per_task ... Browse Code »

Since commit a2d47777 ("sched: fix stale value in average load per task")
the variable rq->avg_load_per_task is no longer required. Remove it.

Signed-off-by: Jan H. Schönherr
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar

Jan H. Schönherr
2011-08-14 17:55:24 +0800

26 Jul, 2011

1 commit

d3ec4844d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...

Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)

Linus Torvalds
2011-07-26 04:56:39 +0800