Eric Lee / linux-smarc-t335x-v3.2

30 Sep, 2011

1 commit

d670ec131 posix-cpu-timers: Cure SMP wobbles ... Browse Code »

David reported:

Attached below is a watered-down version of rt/tst-cpuclock2.c from
GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
similar.

Run it several times, and you will see cases where the main thread
will measure a process clock difference before and after the nanosleep
which is smaller than the cpu-burner thread's individual thread clock
difference. This doesn't make any sense since the cpu-burner thread
is part of the top-level process's thread group.

I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
64-bit binaries).

For example:

[davem@boricha build-x86_64-linux]$ ./test
process: before(0.001221967) after(0.498624371) diff(497402404)
thread: before(0.000081692) after(0.498316431) diff(498234739)
self: before(0.001223521) after(0.001240219) diff(16698)
[davem@boricha build-x86_64-linux]$

The diff of 'process' should always be >= the diff of 'thread'.

I make sure to wrap the 'thread' clock measurements the most tightly
around the nanosleep() call, and that the 'process' clock measurements
are the outer-most ones.

---
#include
#include
#include
#include
#include
#include
#include
#include

static pthread_barrier_t barrier;

static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1)
__asm__ __volatile__("" : : : "memory");
return NULL;
}

int main(void)
{
clockid_t process_clock, my_thread_clock, th_clock;
struct timespec process_before, process_after;
struct timespec me_before, me_after;
struct timespec th_before, th_after;
struct timespec sleeptime;
unsigned long diff;
pthread_t th;
int err;

err = clock_getcpuclockid(0, &process_clock);
if (err)
return 1;

err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
if (err)
return 1;

pthread_barrier_init(&barrier, NULL, 2);
err = pthread_create(&th, NULL, chew_cpu, NULL);
if (err)
return 1;

err = pthread_getcpuclockid(th, &th_clock);
if (err)
return 1;

pthread_barrier_wait(&barrier);

err = clock_gettime(process_clock, &process_before);
if (err)
return 1;

err = clock_gettime(my_thread_clock, &me_before);
if (err)
return 1;

err = clock_gettime(th_clock, &th_before);
if (err)
return 1;

sleeptime.tv_sec = 0;
sleeptime.tv_nsec = 500000000;
nanosleep(&sleeptime, NULL);

err = clock_gettime(th_clock, &th_after);
if (err)
return 1;

err = clock_gettime(my_thread_clock, &me_after);
if (err)
return 1;

err = clock_gettime(process_clock, &process_after);
if (err)
return 1;

diff = process_after.tv_nsec - process_before.tv_nsec;
printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
process_before.tv_sec, process_before.tv_nsec,
process_after.tv_sec, process_after.tv_nsec, diff);
diff = th_after.tv_nsec - th_before.tv_nsec;
printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
th_before.tv_sec, th_before.tv_nsec,
th_after.tv_sec, th_after.tv_nsec, diff);
diff = me_after.tv_nsec - me_before.tv_nsec;
printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
me_before.tv_sec, me_before.tv_nsec,
me_after.tv_sec, me_after.tv_nsec, diff);

return 0;
}

This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.

This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().

Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.

Reported-by: David Miller
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
Tested-by: David Miller
Signed-off-by: Thomas Gleixner

Peter Zijlstra
14 years ago

26 Sep, 2011

1 commit

6ebbe7a07 sched: Fix up wchan borkage ... Browse Code »

Commit c259e01a1ec ("sched: Separate the scheduler entry for
preemption") contained a boo-boo wrecking wchan output. It forgot to
put the new schedule() function in the __sched section and thereby
doesn't get properly ignored for things like wchan.

Tested-by: Simon Kirby
Cc: stable@kernel.org # 2.6.39+
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca
Signed-off-by: Ingo Molnar

Simon Kirby
14 years ago

08 Sep, 2011

1 commit

e81b693c0 Merge branch 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip ... Browse Code »

* 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
sched: Fix a memory leak in __sdt_free()
sched: Move blk_schedule_flush_plug() out of __schedule()
sched: Separate the scheduler entry for preemption

Linus Torvalds
14 years ago

29 Aug, 2011

4 commits

a8d757ef0 perf events: Fix slow and broken cgroup context switch code ... Browse Code »

The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:

$ ./pong
Both processes pinned to CPU1, running for 10s
10684.51 ctxsw/s

Now start a cgroup perf stat:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

$ ./pong
Both processes pinned to CPU1, running for 10s
6674.61 ctxsw/s

That's a 37% penalty.

Note that pong is not even in the monitored cgroup.

The results shown by perf stat are bogus:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

Performance counter stats for 'sleep 100':

CPU1 cycles test
CPU1 16,984,189,138 cycles # 0.000 GHz

The second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.

The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.

With this patch the same test now yields:
$ ./pong
Both processes pinned to CPU1, running for 10s
10775.30 ctxsw/s

Start perf stat with cgroup:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Run pong outside the cgroup:
$ /pong
Both processes pinned to CPU1, running for 10s
10687.80 ctxsw/s

The penalty is now less than 2%.

And the results for perf stat are correct:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Performance counter stats for 'sleep 10':

CPU1 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz

Now perf stat reports the correct counts for
for the non cgroup event.

If we run pong inside the cgroup, then we also get the
correct counts:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Performance counter stats for 'sleep 10':

CPU1 22,297,726,205 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz

10.001457237 seconds time elapsed

Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar

Stephane Eranian
14 years ago
feff8fa00 sched: Fix a memory leak in __sdt_free() ... Browse Code »

This patch fixes the following memory leak:

unreferenced object 0xffff880107266800 (size 512):
comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] create_object+0x187/0x28b
[] kmemleak_alloc+0x73/0x98
[] __kmalloc_node+0x104/0x159
[] kzalloc_node.clone.97+0x15/0x17
[] build_sched_domains+0xb7/0x7f3
[] partition_sched_domains+0x1db/0x24a
[] do_rebuild_sched_domains+0x3b/0x47
[] rebuild_sched_domains+0x10/0x12
[] sched_power_savings_store+0x6c/0x7b
[] sched_mc_power_savings_store+0x16/0x18
[] sysdev_class_store+0x20/0x22
[] sysfs_write_file+0x108/0x144
[] vfs_write+0xaf/0x102
[] sys_write+0x4d/0x74
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff

Signed-off-by: WANG Cong
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org # 3.0
Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com
Signed-off-by: Ingo Molnar

WANG Cong
14 years ago
9c40cef2b sched: Move blk_schedule_flush_plug() out of __schedule() ... Browse Code »

There is no real reason to run blk_schedule_flush_plug() with
interrupts and preemption disabled.

Move it into schedule() and call it when the task is going voluntarily
to sleep. There might be false positives when the task is woken
between that call and actually scheduling, but that's not really
different from being woken immediately after switching away.

This fixes a deadlock in the scheduler where the
blk_schedule_flush_plug() callchain enables interrupts and thereby
allows a wakeup to happen of the task that's going to sleep.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Cc: Jens Axboe
Cc: Linus Torvalds
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org
Signed-off-by: Ingo Molnar

Thomas Gleixner
14 years ago
c259e01a1 sched: Separate the scheduler entry for preemption ... Browse Code »

Block-IO and workqueues call into notifier functions from the
scheduler core code with interrupts and preemption disabled. These
calls should be made before entering the scheduler core.

To simplify this, separate the scheduler core code into
__schedule(). __schedule() is directly called from the places which
set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
checks into schedule(), so they are only called when a task voluntary
goes to sleep.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Cc: Jens Axboe
Cc: Linus Torvalds
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
14 years ago

26 Jul, 2011

1 commit

d3ec4844d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...

Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)

Linus Torvalds
14 years ago

25 Jul, 2011

1 commit

5fabc487c Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

* 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
KVM: IOMMU: Disable device assignment without interrupt remapping
KVM: MMU: trace mmio page fault
KVM: MMU: mmio page fault support
KVM: MMU: reorganize struct kvm_shadow_walk_iterator
KVM: MMU: lockless walking shadow page table
KVM: MMU: do not need atomicly to set/clear spte
KVM: MMU: introduce the rules to modify shadow page table
KVM: MMU: abstract some functions to handle fault pfn
KVM: MMU: filter out the mmio pfn from the fault pfn
KVM: MMU: remove bypass_guest_pf
KVM: MMU: split kvm_mmu_free_page
KVM: MMU: count used shadow pages on prepareing path
KVM: MMU: rename 'pt_write' to 'emulate'
KVM: MMU: cleanup for FNAME(fetch)
KVM: MMU: optimize to handle dirty bit
KVM: MMU: cache mmio info on page fault path
KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
KVM: MMU: do not update slot bitmap if spte is nonpresent
KVM: MMU: fix walking shadow page table
KVM guest: KVM Steal time registration
...

Linus Torvalds
14 years ago

23 Jul, 2011

3 commits

bdc7ccfc0 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair
sched: Replace use of entity_key()
sched: Separate group-scheduling code more clearly
sched: Reorder root_domain to remove 64 bit alignment padding
sched: Do not attempt to destroy uninitialized rt_bandwidth
sched: Remove unused function cpu_cfs_rq()
sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED'
sched, cgroup: Optimize load_balance_fair()
sched: Don't update shares twice on on_rq parent
sched: update correct entity's runtime in check_preempt_wakeup()
xtensa: Use generic config PREEMPT definition
h8300: Use generic config PREEMPT definition
m32r: Use generic PREEMPT config
sched: Skip autogroup when looking for all rt sched groups
sched: Simplify mutex_spin_on_owner()
sched: Remove rcu_read_lock() from wake_affine()
sched: Generalize sleep inside spinlock detection
sched: Make sleeping inside spinlock detection working in !CONFIG_PREEMPT
sched: Isolate preempt counting in its own config option
sched: Remove pointless in_atomic() definition check
...

Linus Torvalds
14 years ago
4d4abdcb1 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tip/linux-2.6-tip

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (123 commits)
perf: Remove the nmi parameter from the oprofile_perf backend
x86, perf: Make copy_from_user_nmi() a library function
perf: Remove perf_event_attr::type check
x86, perf: P4 PMU - Fix typos in comments and style cleanup
perf tools: Make test use the preset debugfs path
perf tools: Add automated tests for events parsing
perf tools: De-opt the parse_events function
perf script: Fix display of IP address for non-callchain path
perf tools: Fix endian conversion reading event attr from file header
perf tools: Add missing 'node' alias to the hw_cache[] array
perf probe: Support adding probes on offline kernel modules
perf probe: Add probed module in front of function
perf probe: Introduce debuginfo to encapsulate dwarf information
perf-probe: Move dwarf library routines to dwarf-aux.{c, h}
perf probe: Remove redundant dwarf functions
perf probe: Move strtailcmp to string.c
perf probe: Rename DIE_FIND_CB_FOUND to DIE_FIND_CB_END
tracing/kprobe: Update symbol reference when loading module
tracing/kprobes: Support module init function probing
kprobes: Return -ENOENT if probe point doesn't exist
...

Linus Torvalds
14 years ago
75b56ec29 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kern… ... Browse Code »

…el/git/tip/linux-2.6-tip

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: Fix lockdep_no_validate against IRQ states
mutex: Make mutex_destroy() an inline function
plist: Remove the need to supply locks to plist heads
lockup detector: Fix reference to the non-existent CONFIG_DETECT_SOFTLOCKUP option

Linus Torvalds
14 years ago

22 Jul, 2011

6 commits

acb5a9ba3 sched: Separate group-scheduling code more clearly ... Browse Code »

Clean up cfs/rt runqueue initialization by moving group scheduling
related code into the corresponding functions.

Also, keep group scheduling as an add-on, so that things are only done
additionally, i. e. remove the init_*_rq() calls from init_tg_*_entry().
(This removes a redundant initalization during sched_init()).

In case of group scheduling rt_rq->highest_prio.curr is now initialized
twice, but adding another #ifdef seems not worth it.

Signed-off-by: Jan H. Schönherr
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1310661163-16606-1-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar

Jan H. Schönherr
14 years ago
26a148eb9 sched: Reorder root_domain to remove 64 bit alignment padding ... Browse Code »

Reorder root_domain to remove 8 bytes of alignment padding on 64 bit
builds, this shrinks the size from 1736 to 1728 bytes, therefore using
one fewer cachelines.

Signed-off-by: Richard Kennedy
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1310726492.1977.5.camel@castor.rsk
Signed-off-by: Ingo Molnar

Richard Kennedy
14 years ago
99bc52429 sched: Do not attempt to destroy uninitialized rt_bandwidth ... Browse Code »

If a task group is to be created and alloc_fair_sched_group() fails,
then the rt_bandwidth of the corresponding task group is not yet
initialized. The caller, sched_create_group(), starts a clean up
procedure which calls free_rt_sched_group() which unconditionally
destroys the not yet initialized rt_bandwidth.

This crashes or hangs the system in lock_hrtimer_base(): UP systems
dereference a NULL pointer, while SMP systems loop endlessly on a
condition that cannot become true.

This patch simply avoids the destruction of rt_bandwidth when the
initialization code path was not reached.

(This was discovered by accident with a custom kernel modification.)

Signed-off-by: Bianca Lutz
Signed-off-by: Jan Schoenherr
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1310580816-10861-7-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar

Bianca Lutz
14 years ago
5f817d676 sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED' ... Browse Code »

This patch fixes a typo located in a comment.

Signed-off-by: Jan Schoenherr
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1310580816-10861-2-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar

Jan Schoenherr
14 years ago
9763b67fb sched, cgroup: Optimize load_balance_fair() ... Browse Code »

Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
achieves that load_balance_fair() only iterates those task_groups that
actually have tasks on busiest, and that we iterate bottom-up, trying to
move light groups before the heavier ones.

No idea if it will actually work out to be beneficial in practice, does
anybody have a cgroup workload that might show a difference one way or
the other?

[ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ]

Signed-off-by: Peter Zijlstra
Reviewed-by: Paul Turner
Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
14 years ago
994bf1c92 Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: pick up the latest scheduler fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
14 years ago

21 Jul, 2011

6 commits

cf6ace16a Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
signal: align __lock_task_sighand() irq disabling and RCU
softirq,rcu: Inform RCU of irq_exit() activity
sched: Add irq_{enter,exit}() to scheduler_ipi()
rcu: protect __rcu_read_unlock() against scheduler-using irq handlers
rcu: Streamline code produced by __rcu_read_unlock()
rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special
rcu: decrease rcu_report_exp_rnp coupling with scheduler

Linus Torvalds
14 years ago
d1e9ae47a Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulm… ... Browse Code »

…ck/linux-2.6-rcu into core/urgent

Ingo Molnar
14 years ago
c5d753a55 sched: Add irq_{enter,exit}() to scheduler_ipi() ... Browse Code »

Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual
work. Traditionally we never did any actual work from the resched IPI
and all magic happened in the return from interrupt path.

Now that we do do some work, we need to ensure irq_{enter,exit} are
called so that we don't confuse things.

This affects things like timekeeping, NO_HZ and RCU, basically
everything with a hook in irq_enter/exit.

Explicit examples of things going wrong are:

sched_clock_cpu() -- has a callback when leaving NO_HZ state to take
a new reading from GTOD and TSC. Without this
callback, time is stuck in the past.

RCU -- needs in_irq() to work in order to avoid some nasty deadlocks

Signed-off-by: Peter Zijlstra
Signed-off-by: Paul E. McKenney

Peter Zijlstra
14 years ago
d110235d2 sched: Avoid creating superfluous NUMA domains on non-NUMA systems ... Browse Code »

When creating sched_domains, stop when we've covered the entire
target span instead of continuing to create domains, only to
later find they're redundant and throw them away again.

This avoids single node systems from touching funny NUMA
sched_domain creation code and reduces the risks of the new
SD_OVERLAP code.

Requested-by: Linus Torvalds
Signed-off-by: Peter Zijlstra
Cc: Anton Blanchard
Cc: mahesh@linux.vnet.ibm.com
Cc: benh@kernel.crashing.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
14 years ago
e3589f6c8 sched: Allow for overlapping sched_domain spans ... Browse Code »

Allow for sched_domain spans that overlap by giving such domains their
own sched_group list instead of sharing the sched_groups amongst
each-other.

This is needed for machines with more than 16 nodes, because
sched_domain_node_span() will generate a node mask from the
16 nearest nodes without regard if these masks have any overlap.

Currently sched_domains have a sched_group that maps to their child
sched_domain span, and since there is no overlap we share the
sched_group between the sched_domains of the various CPUs. If however
there is overlap, we would need to link the sched_group list in
different ways for each cpu, and hence sharing isn't possible.

In order to solve this, allocate private sched_groups for each CPU's
sched_domain but have the sched_groups share a sched_group_power
structure such that we can uniquely track the power.

Reported-and-tested-by: Anton Blanchard
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
14 years ago
9c3f75cbd sched: Break out cpu_power from the sched_group structure ... Browse Code »

In order to prepare for non-unique sched_groups per domain, we need to
carry the cpu_power elsewhere, so put a level of indirection in.

Reported-and-tested-by: Anton Blanchard
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
14 years ago

16 Jul, 2011

1 commit

c64be78ff sched: Fix 32bit race ... Browse Code »

Commit 3fe1698b7fe0 ("sched: Deal with non-atomic min_vruntime reads
on 32bit") forgot to initialize min_vruntime_copy which could lead to
an infinite while loop in task_waking_fair() under some circumstances
(early boot, lucky timing).

[ This bug was also reported by others that blamed it on the RCU
initialization problems ]

Reported-and-tested-by: Bruno Wolff III
Signed-off-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Signed-off-by: Linus Torvalds

Peter Zijlstra
14 years ago

14 Jul, 2011

2 commits

095c0aa83 sched: adjust scheduler cpu power for stolen time ... Browse Code »

This patch makes update_rq_clock() aware of steal time.
The mechanism of operation is not different from irq_time,
and follows the same principles. This lives in a CONFIG
option itself, and can be compiled out independently of
the rest of steal time reporting. The effect of disabling it
is that the scheduler will still report steal time (that cannot be
disabled), but won't use this information for cpu power adjustments.

Everytime update_rq_clock_task() is invoked, we query information
about how much time was stolen since last call, and feed it into
sched_rt_avg_update().

Although steal time reporting in account_process_tick() keeps
track of the last time we read the steal clock, in prev_steal_time,
this patch do it independently using another field,
prev_steal_time_rq. This is because otherwise, information about time
accounted in update_process_tick() would never reach us in update_rq_clock().

Signed-off-by: Glauber Costa
Acked-by: Rik van Riel
Acked-by: Peter Zijlstra
Tested-by: Eric B Munson
CC: Jeremy Fitzhardinge
CC: Anthony Liguori
Signed-off-by: Avi Kivity

Glauber Costa
14 years ago
e6e6685ac KVM guest: Steal time accounting ... Browse Code »

This patch accounts steal time time in account_process_tick.
If one or more tick is considered stolen in the current
accounting cycle, user/system accounting is skipped. Idle is fine,
since the hypervisor does not report steal time if the guest
is halted.

Accounting steal time from the core scheduler give us the
advantage of direct acess to the runqueue data. In a later
opportunity, it can be used to tweak cpu power and make
the scheduler aware of the time it lost.

[avi: doesn't exist on many archs]

Signed-off-by: Glauber Costa
Acked-by: Rik van Riel
Acked-by: Peter Zijlstra
Tested-by: Eric B Munson
CC: Jeremy Fitzhardinge
CC: Anthony Liguori
Signed-off-by: Avi Kivity

Glauber Costa
14 years ago

11 Jul, 2011

1 commit

b7e9c223b Merge branch 'master' into for-next ... Browse Code »

Sync with Linus' tree to be able to apply pending patches that
are based on newer code already present upstream.

Jiri Kosina
14 years ago

09 Jul, 2011

1 commit

d8bf4ca9c rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check ... Browse Code »

Since ca5ecddf (rcu: define __rcu address space modifier for sparse)
rcu_dereference_check use rcu_read_lock_held as a part of condition
automatically so callers do not have to do that as well.

Signed-off-by: Michal Hocko
Acked-by: Paul E. McKenney
Signed-off-by: Jiri Kosina

Michal Hocko
14 years ago

08 Jul, 2011

1 commit

732375c6a plist: Remove the need to supply locks to plist heads ... Browse Code »

This was legacy code brought over from the RT tree and
is no longer necessary.

Signed-off-by: Dima Zavin
Acked-by: Thomas Gleixner
Cc: Daniel Walker
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Andi Kleen
Cc: Lai Jiangshan
Link: http://lkml.kernel.org/r/1310084879-10351-2-git-send-email-dima@android.com
Signed-off-by: Ingo Molnar

Dima Zavin
14 years ago

01 Jul, 2011

5 commits

1ecc818c5 Merge branch 'sched/core-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/fr… ... Browse Code »

…ederic/random-tracing into sched/core

Ingo Molnar
14 years ago
a8b0ca17b perf: Remove the nmi parameter from the swevent and overflow interface ... Browse Code »

The nmi parameter indicated if we could do wakeups from the current
context, if not, we would set some state and self-IPI and let the
resulting interrupt do the wakeup.

For the various event classes:

- hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
the PMI-tail (ARM etc.)
- tracepoint: nmi=0; since tracepoint could be from NMI context.
- software: nmi=[0,1]; some, like the schedule thing cannot
perform wakeups, and hence need 0.

As one can see, there is very little nmi=1 usage, and the down-side of
not using it is that on some platforms some software events can have a
jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).

The up-side however is that we can remove the nmi parameter and save a
bunch of conditionals in fast paths.

Signed-off-by: Peter Zijlstra
Cc: Michael Cree
Cc: Will Deacon
Cc: Deng-Cheng Zhu
Cc: Anton Blanchard
Cc: Eric B Munson
Cc: Heiko Carstens
Cc: Paul Mundt
Cc: David S. Miller
Cc: Frederic Weisbecker
Cc: Jason Wessel
Cc: Don Zickus
Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
14 years ago
307bf9803 sched: Simplify mutex_spin_on_owner() ... Browse Code »

It does not make sense to rcu_read_lock/unlock() in every loop
iteration while spinning on the mutex.

Move the rcu protection outside the loop. Also simplify the
return path to always check for lock->owner == NULL which
meets the requirements of both owner changed and need_resched()
caused loop exits.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1106101458350.11814@ionos
Signed-off-by: Ingo Molnar

Thomas Gleixner
14 years ago
36b2e922b Merge commit 'v3.0-rc5' into sched/core ... Browse Code »

Merge reason: Move to a (much) newer base.

Signed-off-by: Ingo Molnar

Ingo Molnar
14 years ago
cd62287e3 sched, cgroups: Fix MIN_SHARES on 64-bit boxen ... Browse Code »

Commit c8b28116 ("sched: Increase SCHED_LOAD_SCALE resolution")
intended to have no user-visible effect, but allows setting
cpu.shares to < MIN_SHARES, which the user then sees.

Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Cc: Nikhil Rao
Link: http://lkml.kernel.org/r/1307192600.8618.3.camel@marge.simson.net
Signed-off-by: Ingo Molnar

Mike Galbraith
14 years ago

23 Jun, 2011

1 commit

d902db1eb sched: Generalize sleep inside spinlock detection ... Browse Code »

The sleeping inside spinlock detection is actually used
for more general sleeping inside atomic sections
debugging: preemption disabled, rcu read side critical
sections, interrupts, interrupt disabled, etc...

Change the name of the config and its help section to
reflect its more general role.

Signed-off-by: Frederic Weisbecker
Acked-by: Paul E. McKenney
Acked-by: Randy Dunlap
Cc: Peter Zijlstra
Cc: Ingo Molnar

Frederic Weisbecker
14 years ago

10 Jun, 2011

1 commit

bdd4e85dc sched: Isolate preempt counting in its own config option ... Browse Code »

Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec
of preempt count offset independently. So that the offset
can be updated by preempt_disable() and preempt_enable()
even without the need for CONFIG_PREEMPT beeing set.

This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working
with !CONFIG_PREEMPT where it currently doesn't detect
code that sleeps inside explicit preemption disabled
sections.

Signed-off-by: Frederic Weisbecker
Acked-by: Paul E. McKenney
Cc: Ingo Molnar
Cc: Peter Zijlstra

Frederic Weisbecker
14 years ago

08 Jun, 2011

1 commit

2da8c8bc4 sched: Remove pointless in_atomic() definition check ... Browse Code »

It's really supposed to be defined here. If it's not then
we actually want the build to crash so that we know it,
and not keep it silent.

Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Peter Zijlstra

Frederic Weisbecker
14 years ago

07 Jun, 2011

1 commit

6c6c54e18 sched: Fix/clarify set_task_cpu() locking rules ... Browse Code »

Sergey reported a CONFIG_PROVE_RCU warning in push_rt_task where
set_task_cpu() was called with both relevant rq->locks held, which
should be sufficient for running tasks since holding its rq->lock
will serialize against sched_move_task().

Update the comments and fix the task_group() lockdep test.

Reported-and-tested-by: Sergey Senozhatsky
Cc: Oleg Nesterov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1307115427.2353.3456.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
14 years ago

03 Jun, 2011

1 commit

e197f094b Merge branch 'unlikely/sched' of git://git.kernel.org/pub/scm/linux/kernel/git/r… ... Browse Code »

…ostedt/linux-2.6-trace into sched/urgent

Ingo Molnar
14 years ago