Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

16 Jan, 2015

3 commits

170f69f4c sched: Add missing rcu protection to wake_up_all_idle_cpus ... Browse Code »

commit fd7de1e8d5b2b2b35e71332fafb899f584597150 upstream.

Locklessly doing is_idle_task(rq->curr) is only okay because of
RCU protection. The older variant of the broken code checked
rq->curr == rq->idle instead and therefore didn't need RCU.

Fixes: f6be8af1c95d ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
Signed-off-by: Andy Lutomirski
Reviewed-by: Chuansheng Liu
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Andy Lutomirski
2015-01-16 22:59:54 +0800
f88708af7 sched/deadline: Avoid double-accounting in case of missed deadlines ... Browse Code »

commit 269ad8015a6b2bb1cf9e684da4921eb6fa0a0c88 upstream.

The dl_runtime_exceeded() function is supposed to ckeck if
a SCHED_DEADLINE task must be throttled, by checking if its
current runtime is
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Juri Lelli
Cc: Dario Faggioli
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Luca Abeni
2015-01-16 22:59:54 +0800
fbbd9c2a8 sched/deadline: Fix migration of SCHED_DEADLINE tasks ... Browse Code »

commit 6a503c3be937d275113b702e0421e5b0720abe8a upstream.

According to global EDF, tasks should be migrated between runqueues
without checking if their scheduling deadlines and runtimes are valid.
However, SCHED_DEADLINE currently performs such a check:
a migration happens doing:

deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, later_rq->cpu);
activate_task(later_rq, next_task, 0);

which ends up calling dequeue_task_dl(), setting the new CPU, and then
calling enqueue_task_dl().

enqueue_task_dl() then calls enqueue_dl_entity(), which calls
update_dl_entity(), which can modify scheduling deadline and runtime,
breaking global EDF scheduling.

As a result, some of the properties of global EDF are not respected:
for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
two cores can have unbounded response times for the third task even
if 30/80+40/80+120/170 = 1.5809 < 2

This can be fixed by invoking update_dl_entity() only in case of
wakeup, or if this is a new SCHED_DEADLINE task.

Signed-off-by: Luca Abeni
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Juri Lelli
Cc: Dario Faggioli
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Luca Abeni
2015-01-16 22:59:54 +0800

04 Dec, 2014

1 commit

7cc78f8fa context_tracking: Restore previous state in schedule_user ... Browse Code »
13

It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context. This causes RCU
warnings and possible failures.

This is intended to be a minimal fix suitable for 3.18.

Reported-and-tested-by: Dave Jones
Cc: Oleg Nesterov
Cc: Frédéric Weisbecker
Acked-by: Paul E. McKenney
Signed-off-by: Andy Lutomirski
Signed-off-by: Linus Torvalds

Andy Lutomirski
2014-12-04 12:55:58 +0800

24 Nov, 2014

1 commit

90e362f4a sched: Provide update_curr callbacks for stop/idle scheduling classes ... Browse Code »

Chris bisected a NULL pointer deference in task_sched_runtime() to
commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency'.

Chris observed crashes in atop or other /proc walking programs when he
started fork bombs on his machine. He assumed that this is a new exit
race, but that does not make any sense when looking at that commit.

What's interesting is that, the commit provides update_curr callbacks
for all scheduling classes except stop_task and idle_task.

While nothing can ever hit that via the clock_nanosleep() and
clock_gettime() interfaces, which have been the target of the commit in
question, the author obviously forgot that there are other code paths
which invoke task_sched_runtime()

do_task_stat(()
thread_group_cputime_adjusted()
thread_group_cputime()
task_cputime()
task_sched_runtime()
if (task_current(rq, p) && task_on_rq_queued(p)) {
update_rq_clock(rq);
up->sched_class->update_curr(rq);
}

If the stats are read for a stomp machine task, aka 'migration/N' and
that task is current on its cpu, this will happily call the NULL pointer
of stop_task->update_curr. Ooops.

Chris observation that this happens faster when he runs the fork bomb
makes sense as the fork bomb will kick migration threads more often so
the probability to hit the issue will increase.

Add the missing update_curr callbacks to the scheduler classes stop_task
and idle_task. While idle tasks cannot be monitored via /proc we have
other means to hit the idle case.

Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
Reported-by: Chris Mason
Reported-and-tested-by: Borislav Petkov
Signed-off-by: Thomas Gleixner
Cc: Ingo Molnar
Cc: Stanislaw Gruszka
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds

Thomas Gleixner
2014-11-24 06:14:40 +0800

16 Nov, 2014

3 commits

6e998916d sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency ... Browse Code »
48

Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include

/* Parameters for the Linux kernel ABI for CPU clocks. */
#define CPUCLOCK_SCHED 2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

static pthread_barrier_t barrier;

/* Help advance the clock. */
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1) ;

return NULL;
}

/* Don't use the glibc wrapper. */
static int do_nanosleep(int flags, const struct timespec *req)
{
clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}

static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

return after_i - before_i;
}

int main(void)
{
int result = 0;
pthread_t th;

pthread_barrier_init(&barrier, NULL, 2);

if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
perror("pthread_create");
return 1;
}

pthread_barrier_wait(&barrier);

/* The test. */
struct timespec before, after, sleeptimeabs;
int64_t sleepdiff, diffabs;
const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

/* The relative nanosleep. Not sure why this is needed, but its presence
seems to make it easier to reproduce the problem. */
if (do_nanosleep(0, &sleeptime) != 0) {
perror("clock_nanosleep");
return 1;
}

/* Get the current time. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
perror("clock_gettime[2]");
return 1;
}

/* Compute the absolute sleep time based on the current time. */
uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
sleeptimeabs.tv_nsec = nsec % 1000000000;

/* Sleep for the computed time. */
if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
perror("absolute clock_nanosleep");
return 1;
}

/* Get the time after the sleep. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
perror("clock_gettime[3]");
return 1;
}

/* The time after sleep should always be equal to or after the absolute sleep
time passed to clock_nanosleep. */
sleepdiff = tsdiff(&sleeptimeabs, &after);
if (sleepdiff < 0) {
printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
result = 1;

printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
}

/* The difference between the timestamps taken before and after the
clock_nanosleep call should be equal to or more than the duration of the
sleep. */
diffabs = tsdiff(&before, &after);
if (diffabs < sleeptime.tv_nsec) {
printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
result = 1;
}

pthread_cancel(th);

return result;
}

Signed-off-by: Stanislaw Gruszka
Signed-off-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Frederic Weisbecker
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar

Stanislaw Gruszka
2014-11-16 17:04:20 +0800
23cfa361f sched/cputime: Fix cpu_timer_sample_group() double accounting ... Browse Code »

While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:

cpu_timer_sample_group()
thread_group_cputimer()
thread_group_cputime()
times->sum_exec_runtime += task_sched_runtime();

*sample = cputime.sum_exec_runtime + task_delta_exec();

Which would make the sample run ahead, making the sleep short.

Signed-off-by: Peter Zijlstra (Intel)
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Stanislaw Gruszka
Cc: Christoph Lameter
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Rik van Riel
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-11-16 17:04:18 +0800
7af683350 sched/numa: Avoid selecting oneself as swap target ... Browse Code »

Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Reported-and-Tested-by: Sasha Levin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-11-16 17:04:17 +0800

10 Nov, 2014

1 commit

c123588b3 sched/numa: Fix out of bounds read in sched_init_numa() ... Browse Code »

On latest mm + KASan patchset I've got this:

==================================================================
BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
=============================================================================
BUG kmalloc-8 (Not tainted): kasan error
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
__slab_alloc+0x4b4/0x4f0
__kmalloc_track_caller+0x15f/0x1e0
kstrdup+0x44/0x90
alloc_vfsmnt+0xb0/0x2c0
vfs_kern_mount+0x35/0x190
kern_mount_data+0x25/0x50
pid_ns_prepare_proc+0x19/0x50
alloc_pid+0x5e2/0x630
copy_process.part.41+0xdf5/0x2aa0
do_fork+0xf5/0x460
kernel_thread+0x21/0x30
rest_init+0x1e/0x90
start_kernel+0x522/0x531
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x15b/0x16a
INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70

Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk.
Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........
Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
Call Trace:
dump_stack (lib/dump_stack.c:52)
print_trailer (mm/slub.c:645)
object_err (mm/slub.c:652)
? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
? kasan_poison_shadow (mm/kasan/kasan.c:48)
? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
? kasan_poison_shadow (mm/kasan/kasan.c:48)
? kasan_kmalloc (mm/kasan/kasan.c:311)
__asan_load4 (mm/kasan/kasan.c:371)
? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
kernel_init_freeable (init/main.c:869 init/main.c:997)
? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
? rest_init (init/main.c:924)
kernel_init (init/main.c:929)
? rest_init (init/main.c:924)
ret_from_fork (arch/x86/kernel/entry_64.S:348)
? rest_init (init/main.c:924)
Read of size 4 by task swapper/0:
Memory state around the buggy address:
ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
^
ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Zero 'level' (e.g. on non-NUMA system) causing out of bounds
access in this line:

sched_max_numa_distance = sched_domains_numa_distance[level - 1];

Fix this by exiting from sched_init_numa() earlier.

Signed-off-by: Andrey Ryabinin
Reviewed-by: Rik van Riel
Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar

Andrey Ryabinin
2014-11-10 17:33:22 +0800

04 Nov, 2014

1 commit

f7b8a47da sched: Remove lockdep check in sched_move_task() ... Browse Code »

sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.

Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do no
need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.

Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.

Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
Reported-by: Oleg Nesterov
Reported-by: Fengguang Wu
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-11-04 14:07:30 +0800

28 Oct, 2014

8 commits

f3a7e1a9c sched/dl: Fix preemption checks ... Browse Code »

1) switched_to_dl() check is wrong. We reschedule only
if rq->curr is deadline task, and we do not reschedule
if it's a lower priority task. But we must always
preempt a task of other classes.

2) dl_task_timer():
Policy does not change in case of priority inheritance.
rt_mutex_setprio() changes prio, while policy remains old.

So we lose some balancing logic in dl_task_timer() and
switched_to_dl() when we check policy instead of priority. Boosted
task may be rq->curr.

(I didn't change switched_from_dl() because no check is necessary
there at all).

I've looked at this place(switched_to_dl) several times and even fixed
this function, but found just now... I suppose some performance tests
may work better after this.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-28 17:46:10 +0800
009f60e27 sched: stop the unbound recursion in preempt_schedule_context() ... Browse Code »

preempt_schedule_context() does preempt_enable_notrace() at the end
and this can call the same function again; exception_exit() is heavy
and it is quite possible that need-resched is true again.

1. Change this code to dec preempt_count() and check need_resched()
by hand.

2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
the enable/disable dance around __schedule(). But in this case
we need to move into sched/core.c.

3. Cosmetic, but x86 forgets to declare this function. This doesn't
really matter because it is only called by asm helpers, still it
make sense to add the declaration into asm/preempt.h to match
preempt_schedule().

Reported-by: Sasha Levin
Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Graf
Cc: Andrew Morton
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Steven Rostedt
Cc: Peter Anvin
Cc: Andy Lutomirski
Cc: Denys Vlasenko
Cc: Chuck Ebbert
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2014-10-28 17:46:05 +0800
641926589 sched/fair: Fix division by zero sysctl_numa_balancing_scan_size ... Browse Code »

File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

This bash command reproduces problem:

$ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

divide error: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
RIP: 0010:[] [] task_scan_min+0x21/0x50
RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
Stack:
ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
Call Trace:
[] task_scan_max+0x11/0x40
[] task_numa_fault+0x1f7/0xae0
[] ? migrate_misplaced_page+0x276/0x300
[] handle_mm_fault+0x62d/0xba0
[] __do_page_fault+0x191/0x510
[] ? native_smp_send_reschedule+0x42/0x60
[] ? check_preempt_curr+0x80/0xa0
[] ? wake_up_new_task+0x11c/0x1a0
[] ? do_fork+0x14d/0x340
[] ? get_unused_fd_flags+0x2b/0x30
[] ? __fd_install+0x1f/0x60
[] do_page_fault+0xc/0x10
[] page_fault+0x22/0x30
RIP [] task_scan_min+0x21/0x50
RSP
---[ end trace 9a826d16936c04de ]---

Also fix race in task_scan_min (it depends on compiler behaviour).

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Aaron Tomlin
Cc: Andrew Morton
Cc: Dario Faggioli
Cc: David Rientjes
Cc: Jens Axboe
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-28 17:46:04 +0800
2847c90e1 sched/fair: Care divide error in update_task_scan_period() ... Browse Code »

While offling node by hot removing memory, the following divide error
occurs:

divide error: 0000 [#1] SMP
[...]
Call Trace:
[...] handle_mm_fault
[...] ? try_to_wake_up
[...] ? wake_up_state
[...] __do_page_fault
[...] ? do_futex
[...] ? put_prev_entity
[...] ? __switch_to
[...] do_page_fault
[...] page_fault
[...]
RIP [] task_numa_fault
RSP

The issue occurs as follows:
1. When page fault occurs and page is allocated from node 1,
task_struct->numa_faults_buffer_memory[] of node 1 is
incremented and p->numa_faults_locality[] is also incremented
as follows:

o numa_faults_buffer_memory[] o numa_faults_locality[]
NR_NUMA_HINT_FAULT_TYPES
| 0 | 1 |
---------------------------------- ----------------------
node 0 | 0 | 0 | remote | 0 |
node 1 | 0 | 1 | locale | 1 |
---------------------------------- ----------------------

2. node 1 is offlined by hot removing memory.

3. When page fault occurs, fault_types[] is calculated by using
p->numa_faults_buffer_memory[] of all online nodes in
task_numa_placement(). But node 1 was offline by step 2. So
the fault_types[] is calculated by using only
p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
are set to 0.

4. The values(0) of fault_types[] pass to update_task_scan_period().

5. numa_faults_locality[1] is set to 1. So the following division is
calculated.

static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private){
...
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
}

6. But both of private and shared are set to 0. So divide error
occurs here.

The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.

Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar

Yasuaki Ishimatsu
2014-10-28 17:46:03 +0800
1effd9f19 sched/numa: Fix unsafe get_task_struct() in task_numa_assign() ... Browse Code »
2

Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:

task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...

As noted by Oleg:

<
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-28 17:46:02 +0800
aee38ea95 sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer() ... Browse Code »

dl_task_timer() is racy against several paths. Daniel noticed that
the replenishment timer may experience a race condition against an
enqueue_dl_entity() called from rt_mutex_setprio(). With his own
words:

rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
throttled is 0

=> BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
enqueued on the -deadline runqueue.

As we do for the other races, we just bail out in the replenishment
timer code.

Reported-by: Daniel Wagner
Tested-by: Daniel Wagner
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: vincent@legout.info
Cc: Dario Faggioli
Cc: Michael Trimarchi
Cc: Fabio Checconi
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-10-28 17:46:01 +0800
64be6f1f5 sched/deadline: Don't replenish from a !SCHED_DEADLINE entity ... Browse Code »

In the deboost path, right after the dl_boosted flag has been
reset, we can currently end up replenishing using -deadline
parameters of a !SCHED_DEADLINE entity. This of course causes
a bug, as those parameters are empty.

In the case depicted above it is safe to simply bail out, as
the deboosted task is going to be back to its original scheduling
class anyway.

Reported-by: Daniel Wagner
Tested-by: Daniel Wagner
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: vincent@legout.info
Cc: Dario Faggioli
Cc: Michael Trimarchi
Cc: Fabio Checconi
Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-10-28 17:46:00 +0800
eeb61e53e sched: Fix race between task_group and sched_task_group ... Browse Code »
13

The race may happen when somebody is changing task_group of a forking task.
Child's cgroup is the same as parent's after dup_task_struct() (there just
memory copying). Also, cfs_rq and rt_rq are the same as parent's.

But if parent changes its task_group before it's called cgroup_post_fork(),
we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
the same, while child's task_group changes in cgroup_post_fork().

To fix this we introduce fork() method, which calls sched_move_task() directly.
This function changes sched_task_group on appropriate (also its logic has
no problem with freshly created tasks, so we shouldn't introduce something
special; we are able just to use it).

Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-28 17:45:59 +0800

15 Oct, 2014

1 commit

0429fbc0b Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.

This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().

This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...

Linus Torvalds
2014-10-15 13:48:18 +0800

13 Oct, 2014

2 commits

faafcba3b Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:

- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)

- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

- sched/numa updates and optimizations (Rik van Riel)

- sysbench speedup (Vincent Guittot)

- capacity calculation cleanups/refactoring (Vincent Guittot)

- Various cleanups to thread group iteration (Oleg Nesterov)

- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)

- various sched/deadline fixes

... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...

Linus Torvalds
2014-10-13 22:23:15 +0800
6d5f0ebfc Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking updates from Ingo Molnar:
"The main updates in this cycle were:

- mutex MCS refactoring finishing touches: improve comments, refactor
and clean up code, reduce debug data structure footprint, etc.

- qrwlock finishing touches: remove old code, self-test updates.

- small rwsem optimization

- various smaller fixes/cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Revert qrwlock recusive stuff
locking/rwsem: Avoid double checking before try acquiring write lock
locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition
locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S
locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code
locking/semaphore: Resolve some shadow warnings
locking/selftest: Support queued rwlock
locking/lockdep: Restrict the use of recursive read_lock() with qrwlock
locking/spinlocks: Always evaluate the second argument of spin_lock_nested()
locking/Documentation: Update locking/mutex-design.txt disadvantages
locking/Documentation: Move locking related docs into Documentation/locking/
locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriate
locking/mutexes: Refactor optimistic spinning code
locking/mcs: Remove obsolete comment
locking/mutexes: Document quick lock release when unlocking
locking/mutexes: Standardize arguments in lock/unlock slowpaths
locking: Remove deprecated smp_mb__() barriers

Linus Torvalds
2014-10-13 21:51:40 +0800

10 Oct, 2014

1 commit

6b6482bbf mempolicy: remove the "task" arg of vma_policy_mof() and simplify it ... Browse Code »

1. vma_policy_mof(task) is simply not safe unless task == current,
it can race with do_exit()->mpol_put(). Remove this arg and update
its single caller.

2. vma can not be NULL, remove this check and simplify the code.

Signed-off-by: Oleg Nesterov
Cc: KAMEZAWA Hiroyuki
Cc: David Rientjes
Cc: KOSAKI Motohiro
Cc: Alexander Viro
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Cc: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Andi Kleen
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:56 +0800

09 Oct, 2014

1 commit

25641c0c8 Merge tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client updates from Trond Myklebust:
"Highlights include:

Stable fixes:
- fix an NFSv4.1 state renewal regression
- fix open/lock state recovery error handling
- fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
- fix statd when reconnection fails
- don't wake tasks during connection abort
- don't start reboot recovery if lease check fails
- fix duplicate proc entries

Features:
- pNFS block driver fixes and clean ups from Christoph
- More code cleanups from Anna
- Improve mmap() writeback performance
- Replace use of PF_TRANS with a more generic mechanism for avoiding
deadlocks in nfs_release_page"

* tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
NFSv4.1: Fix an NFSv4.1 state renewal regression
NFSv4: fix open/lock state recovery error handling
NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
NFS: Fabricate fscache server index key correctly
SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
NFSv3: Fix missing includes of nfs3_fs.h
NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
NFS: avoid waiting at all in nfs_release_page when congested.
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
MM: export page_wakeup functions
SCHED: add some "wait..on_bit...timeout()" interfaces.
NFS: don't use STABLE writes during writeback.
NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
rpc: Add -EPERM processing for xs_udp_send_request()
rpc: return sent and err from xs_sendpages()
lockd: Try to reconnect if statd has moved
SUNRPC: Don't wake tasks during connection abort
Fixing lease renewal
nfs: fix duplicate proc entries
pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
...

Linus Torvalds
2014-10-09 00:49:23 +0800

03 Oct, 2014

4 commits

f10e00f4b sched/dl: Use dl_bw_of() under rcu_read_lock_sched() ... Browse Code »
13

rq->rd is freed using call_rcu_sched(), so rcu_read_lock() to access it
is not enough. We should use either rcu_read_lock_sched() or preempt_disable().

Reported-by: Sasha Levin
Suggested-by: Peter Zijlstra
Signed-off-by: Kirill Tkhai
Fixes: 66339c31bc39 "sched: Use dl_bw_of() under RCU read lock"
Link: http://lkml.kernel.org/r/1412065417.20287.24.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-03 11:46:58 +0800
10a12983b sched/fair: Delete resched_cpu() from idle_balance() ... Browse Code »

We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
if this is necessary.

Furthermore, a higher priority class task may be current on dest rq,
we shouldn't disturb it.

Signed-off-by: Kirill Tkhai
Cc: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhost
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-03 11:46:56 +0800
347abad98 sched, time: Fix build error with 64 bit cputime_t on 32 bit systems ... Browse Code »

On 32 bit systems cmpxchg cannot handle 64 bit values, so
some additional magic is required to allow a 32 bit system
with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build.

Make sure the correct cmpxchg function is used when doing
an atomic swap of a cputime_t.

Reported-by: Arnd Bergmann
Signed-off-by: Rik van Riel
Acked-by: Arnd Bergmann
Signed-off-by: Peter Zijlstra (Intel)
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Cc: Andrew Morton
Cc: Benjamin Herrenschmidt
Cc: Heiko Carstens
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: linux390@de.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-10-03 11:46:55 +0800
43f4d6663 sched: Improve sysbench performance by fixing spurious active migration ... Browse Code »

Since commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() ...")
sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
but is only more loaded than others. This change has been introduced to ensure
a better load balance in system that are not overloaded but as a side effect,
it can also generate useless active migration between groups.

Let take the example of 3 tasks on a quad cores system. We will always have an
idle core so the load balance will find a busiest group (core) whenever an ILB
is triggered and it will force an active migration (once above
nr_balance_failed threshold) so the idle core becomes busy but another core
will become idle. With the next ILB, the freshly idle core will try to pull the
task of a busy CPU.
The number of spurious active migration is not so huge in quad core system
because the ILB is not triggered so much. But it becomes significant as soon as
you have more than one sched_domain level like on a dual cluster of quad cores
where the ILB is triggered every tick when you have more than 1 busy_cpu

We need to ensure that the migration generate a real improveùent and will not
only move the avg_load imbalance on another CPU.

Before caeb178c60f4f93f1b45c0bc056b5cf6d217b67f, the filtering of such use
case was ensured by the following test in f_b_g:

if ((local->idle_cpus < busiest->idle_cpus) &&
busiest->sum_nr_running group_weight)

This patch modified the condition to take into account situation where busiest
group is not overloaded: If the diff between the number of idle cpus in 2
groups is less than or equal to 1 and the busiest group is not overloaded,
moving a task will not improve the load balance but just move it.

A test with sysbench on a dual clusters of quad cores gives the following
results:

command: sysbench --test=cpu --num-threads=5 --max-time=5 run

The HZ is 200 which means that 1000 ticks has fired during the test.

With Mainline, perf gives the following figures:

Samples: 727 of event 'sched:sched_migrate_task'
Event count (approx.): 727
Overhead Command Shared Object Symbol
........ ............... ............. ..............
12.52% migration/1 [unknown] [.] 00000000
12.52% migration/5 [unknown] [.] 00000000
12.52% migration/7 [unknown] [.] 00000000
12.10% migration/6 [unknown] [.] 00000000
11.83% migration/0 [unknown] [.] 00000000
11.83% migration/3 [unknown] [.] 00000000
11.14% migration/4 [unknown] [.] 00000000
10.87% migration/2 [unknown] [.] 00000000
2.75% sysbench [unknown] [.] 00000000
0.83% swapper [unknown] [.] 00000000
0.55% ktps65090charge [unknown] [.] 00000000
0.41% mmcqd/1 [unknown] [.] 00000000
0.14% perf [unknown] [.] 00000000

With this patch, perf gives the following figures

Samples: 20 of event 'sched:sched_migrate_task'
Event count (approx.): 20
Overhead Command Shared Object Symbol
........ ............... ............. ..............
80.00% sysbench [unknown] [.] 00000000
10.00% swapper [unknown] [.] 00000000
5.00% ktps65090charge [unknown] [.] 00000000
5.00% migration/1 [unknown] [.] 00000000

Signed-off-by: Vincent Guittot
Reviewed-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2014-10-03 11:46:54 +0800

25 Sep, 2014

1 commit

cbbce8220 SCHED: add some "wait..on_bit...timeout()" interfaces. ... Browse Code »
5

In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
sched: Allow wait_on_bit_action() functions to support a timeout

I suggested that a "wait_on_bit_timeout()" interface would not meet my
need. This isn't true - I was just over-engineering.

Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization. If some other
use is ever found, it can be generalized or added later.

So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:

wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example. Others can be added as needed.

Acked-by: Peter Zijlstra (Intel)
Signed-off-by: NeilBrown
Acked-by: Ingo Molnar
Signed-off-by: Trond Myklebust

NeilBrown
2014-09-25 20:23:57 +0800

24 Sep, 2014

12 commits

8aa6f0ebf sched/rt: Use resched_curr() in task_tick_rt() ... Browse Code »

Some time ago PREEMPT_NEED_RESCHED was implemented,
so reschedule technics is a little more difficult now.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140922183642.11015.66039.stgit@localhost
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-09-24 20:47:12 +0800
f1e3a0932 sched: Use rq->rd in sched_setaffinity() under RCU read lock ... Browse Code »
5

Probability of use-after-free isn't zero in this place.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: # v3.14+
Cc: Paul E. McKenney
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-09-24 20:47:11 +0800
16303ab2f sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask' ... Browse Code »

Nothing is locked there, so label's name only confuses a reader.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Link: http://lkml.kernel.org/r/20140922183630.11015.59500.stgit@localhost
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-09-24 20:47:10 +0800
66339c31b sched: Use dl_bw_of() under RCU read lock ... Browse Code »
18

dl_bw_of() dereferences rq->rd which has to have RCU read lock held.
Probability of use-after-free isn't zero here.

Also add lockdep assert into dl_bw_cpus().

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: # v3.14+
Cc: Paul E. McKenney
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-09-24 20:47:09 +0800
7a96c231c sched/fair: Remove duplicate code from can_migrate_task() ... Browse Code »

Combine two branches which do the same.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140922183612.11015.64200.stgit@localhost
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-09-24 20:47:07 +0800
c55f5158f sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW ... Browse Code »

Kirill found that there's a subtle race in the
__ARCH_WANT_UNLOCKED_CTXSW code, and instead of fixing it, remove the
entire exception because neither arch that uses it seems to actually
still require it.

Boot tested on mips64el (qemu) only.

Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Kirill Tkhai
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Fenghua Yu
Cc: James Hogan
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Paul Burton
Cc: Qais Yousef
Cc: Ralf Baechle
Cc: Tony Luck
Cc: oleg@redhat.com
Cc: linux@roeck-us.net
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Link: http://lkml.kernel.org/r/20140923150641.GH3312@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-09-24 20:47:05 +0800
5bd96ab6f sched: print_rq(): Don't use tasklist_lock ... Browse Code »

read_lock_irqsave(tasklist_lock) in print_rq() looks strange. We do
not need to disable irqs, and they are already disabled by the caller.

And afaics this lock buys nothing, we can rely on rcu_read_lock().
In this case it makes sense to also move rcu_read_lock/unlock from
the caller to print_rq().

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Kirill Tkhai
Cc: Mike Galbraith
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140921193341.GA28628@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2014-09-24 20:47:04 +0800
3472eaa1f sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock() ... Browse Code »
2

1. read_lock(tasklist_lock) does not need to disable irqs.

2. ->mm != NULL is a common mistake, use PF_KTHREAD.

3. The second ->mm check can be simply removed.

4. task_rq_lock() looks better than raw_spin_lock(&p->pi_lock) +
__task_rq_lock().

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Kirill Tkhai
Cc: Mike Galbraith
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140921193338.GA28621@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2014-09-24 20:47:03 +0800
8651c6584 sched: Fix the task-group check in tg_has_rt_tasks() ... Browse Code »

tg_has_rt_tasks() wants to find an RT task in this task_group, but
task_rq(p)->rt.tg wrongly checks the root rt_rq.

Signed-off-by: Oleg Nesterov
Reviewed-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/20140921193336.GA28618@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2014-09-24 20:47:00 +0800
83a0a96a5 sched/fair: Leverage the idle state info when choosing the "idlest" cpu ... Browse Code »

The code in find_idlest_cpu() looks for the CPU with the smallest load.
However, if multiple CPUs are idle, the first idle CPU is selected
irrespective of the depth of its idle state.

Among the idle CPUs we should pick the one with with the shallowest idle
state, or the latest to have gone idle if all idle CPUs are in the same
state. The later applies even when cpuidle is configured out.

This patch doesn't cover the following issues:

- The idle exit latency of a CPU might be larger than the time needed
to migrate the waking task to an already running CPU with sufficient
capacity, and therefore performance would benefit from task packing
in such case (in most cases task packing is about power saving).

- Some idle states have a non negligible and non abortable entry latency
which needs to run to completion before the exit latency can start.
A concurrent patch series is making this info available to the cpuidle
core. Once available, the entry latency with the idle timestamp could
determine when the exit latency may be effective.

Those issues will be handled in due course. In the mean time, what
is implemented here should improve things already compared to the current
state of affairs.

Based on an initial patch from Daniel Lezcano.

Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra (Intel)
Cc: Daniel Lezcano
Cc: "Rafael J. Wysocki"
Cc: Linus Torvalds
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar

Nicolas Pitre
2014-09-24 20:46:59 +0800
442bf3aaf sched: Let the scheduler see CPU idle states ... Browse Code »

When the cpu enters idle, it stores the cpuidle state pointer in its
struct rq instance which in turn could be used to make a better decision
when balancing tasks.

As soon as the cpu exits its idle state, the struct rq reference is
cleared.

There are a couple of situations where the idle state pointer could be changed
while it is being consulted:

1. For x86/acpi with dynamic c-states, when a laptop switches from battery
to AC that could result on removing the deeper idle state. The acpi driver
triggers:
'acpi_processor_cst_has_changed'
'cpuidle_pause_and_lock'
'cpuidle_uninstall_idle_handler'
'kick_all_cpus_sync'.

All cpus will exit their idle state and the pointed object will be set to
NULL.

2. The cpuidle driver is unloaded. Logically that could happen but not
in practice because the drivers are always compiled in and 95% of them are
not coded to unregister themselves. In any case, the unloading code must
call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
leading to 'kick_all_cpus_sync' as mentioned above.

A race can happen if we use the pointer and then one of these two scenarios
occurs at the same moment.

In order to be safe, the idle state pointer stored in the rq must be
used inside a rcu_read_lock section where we are protected with the
'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
idle_get_state() and idle_put_state() accessors should be used to that
effect.

Signed-off-by: Daniel Lezcano
Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra (Intel)
Cc: "Rafael J. Wysocki"
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Cc: Daniel Lezcano
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar

Daniel Lezcano
2014-09-24 20:46:58 +0800
91ec6778e sched/deadline: Fix inter- exclusive cpusets migrations ... Browse Code »

Users can perform clustered scheduling using the cpuset facility.
After an exclusive cpuset is created, task migrations happen only
between CPUs belonging to the same cpuset. Inter- cpuset migrations
can only happen when the user requires so, moving a task between
different cpusets. This behaviour is broken in SCHED_DEADLINE, as
currently spurious inter- cpuset migration may happen without user
intervention.

This patch fix the problem (and shuffles the code a bit to improve
clarity).

Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: raistlin@linux.it
Cc: michael@amarulasolutions.com
Cc: fchecconi@gmail.com
Cc: daniel.wagner@bmw-carit.de
Cc: vincent@legout.info
Cc: luca.abeni@unitn.it
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1411118561-26323-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-09-24 20:46:57 +0800