Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

11 Dec, 2014

1 commit

a90e984c8 sched_show_task: fix unsafe usage of ->real_parent ... Browse Code »

rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.

Note: we need some helpers to cleanup the code like this. And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Acked-by: Peter Zijlstra (Intel)
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800

10 Dec, 2014

2 commits

86c6a2fdd Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:

- 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.

This instruments might_sleep() checks to catch places that nest
blocking primitives - such as mutex usage in a wait loop. Such
bugs can result in hard to debug races/hangs.

Another category of invalid nesting that this facility will detect
is the calling of blocking functions from within schedule() ->
sched_submit_work() -> blk_schedule_flush_plug().

There's some potential for false positives (if secondary blocking
primitives themselves are not ready yet for this facility), but the
kernel will warn once about such bugs per bootup, so the warning
isn't much of a nuisance.

This feature comes with a number of fixes, for problems uncovered
with it, so no messages are expected normally.

- Another round of sched/numa optimizations and refinements, for
CONFIG_NUMA_BALANCING=y.

- Another round of sched/dl fixes and refinements.

Plus various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched: Add missing rcu protection to wake_up_all_idle_cpus
sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
sched/numa: Init numa balancing fields of init_task
sched/deadline: Remove unnecessary definitions in cpudeadline.h
sched/cpupri: Remove unnecessary definitions in cpupri.h
sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
sched/fair: Fix stale overloaded status in the busiest group finding logic
sched: Move p->nr_cpus_allowed check to select_task_rq()
sched/completion: Document when to use wait_for_completion_io_*()
sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
sched/fair: Kill task_struct::numa_entry and numa_group::task_list
sched: Refactor task_struct to use numa_faults instead of numa_* pointers
sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
sched/deadline: Reschedule from switched_from_dl() after a successful pull
sched/deadline: Push task away if the deadline is equal to curr during wakeup
sched/deadline: Add deadline rq status print
sched/deadline: Fix artificial overrun introduced by yield_task_dl()
sched/rt: Clean up check_preempt_equal_prio()
sched/core: Use dl_bw_of() under rcu_read_lock_sched()
sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
...

Linus Torvalds
2014-12-10 13:21:34 +0800
c30110608 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"These are the main changes in this cycle:

- Streamline RCU's use of per-CPU variables, shifting from "cpu"
arguments to functions to "this_"-style per-CPU variable
accessors.

- signal-handling RCU updates.

- real-time updates.

- torture-test updates.

- miscellaneous fixes.

- documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
rcu: Fix FIXME in rcu_tasks_kthread()
rcu: More info about potential deadlocks with rcu_read_unlock()
rcu: Optimize cond_resched_rcu_qs()
rcu: Add sparse check for RCU_INIT_POINTER()
documentation: memory-barriers.txt: Correct example for reorderings
documentation: Add atomic_long_t to atomic_ops.txt
documentation: Additional restriction for control dependencies
documentation: Document RCU self test boot params
rcutorture: Fix rcu_torture_cbflood() memory leak
rcutorture: Remove obsolete kversion param in kvm.sh
rcutorture: Remove stale test configurations
rcutorture: Enable RCU self test in configs
rcutorture: Add early boot self tests
torture: Run Linux-kernel binary out of results directory
cpu: Avoid puts_pending overflow
rcu: Remove "cpu" argument to rcu_cleanup_after_idle()
rcu: Remove "cpu" argument to rcu_prepare_for_idle()
rcu: Remove "cpu" argument to rcu_needs_cpu()
rcu: Remove "cpu" argument to rcu_note_context_switch()
rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()
...

Linus Torvalds
2014-12-10 12:23:19 +0800

08 Dec, 2014

1 commit

fd7de1e8d sched: Add missing rcu protection to wake_up_all_idle_cpus ... Browse Code »

Locklessly doing is_idle_task(rq->curr) is only okay because of
RCU protection. The older variant of the broken code checked
rq->curr == rq->idle instead and therefore didn't need RCU.

Fixes: f6be8af1c95d ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
Signed-off-by: Andy Lutomirski
Reviewed-by: Chuansheng Liu
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar

Andy Lutomirski
2014-12-08 18:44:19 +0800

04 Dec, 2014

1 commit

7cc78f8fa context_tracking: Restore previous state in schedule_user ... Browse Code »
13

It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context. This causes RCU
warnings and possible failures.

This is intended to be a minimal fix suitable for 3.18.

Reported-and-tested-by: Dave Jones
Cc: Oleg Nesterov
Cc: Frédéric Weisbecker
Acked-by: Paul E. McKenney
Signed-off-by: Andy Lutomirski
Signed-off-by: Linus Torvalds

Andy Lutomirski
2014-12-04 12:55:58 +0800

24 Nov, 2014

1 commit

90e362f4a sched: Provide update_curr callbacks for stop/idle scheduling classes ... Browse Code »

Chris bisected a NULL pointer deference in task_sched_runtime() to
commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency'.

Chris observed crashes in atop or other /proc walking programs when he
started fork bombs on his machine. He assumed that this is a new exit
race, but that does not make any sense when looking at that commit.

What's interesting is that, the commit provides update_curr callbacks
for all scheduling classes except stop_task and idle_task.

While nothing can ever hit that via the clock_nanosleep() and
clock_gettime() interfaces, which have been the target of the commit in
question, the author obviously forgot that there are other code paths
which invoke task_sched_runtime()

do_task_stat(()
thread_group_cputime_adjusted()
thread_group_cputime()
task_cputime()
task_sched_runtime()
if (task_current(rq, p) && task_on_rq_queued(p)) {
update_rq_clock(rq);
up->sched_class->update_curr(rq);
}

If the stats are read for a stomp machine task, aka 'migration/N' and
that task is current on its cpu, this will happily call the NULL pointer
of stop_task->update_curr. Ooops.

Chris observation that this happens faster when he runs the fork bomb
makes sense as the fork bomb will kick migration threads more often so
the probability to hit the issue will increase.

Add the missing update_curr callbacks to the scheduler classes stop_task
and idle_task. While idle tasks cannot be monitored via /proc we have
other means to hit the idle case.

Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
Reported-by: Chris Mason
Reported-and-tested-by: Borislav Petkov
Signed-off-by: Thomas Gleixner
Cc: Ingo Molnar
Cc: Stanislaw Gruszka
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds

Thomas Gleixner
2014-11-24 06:14:40 +0800

20 Nov, 2014

1 commit

d360b78f9 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck… ... Browse Code »

…/linux-rcu into core/rcu

Pull RCU updates from Paul E. McKenney:

- Streamline RCU's use of per-CPU variables, shifting from "cpu"
arguments to functions to "this_"-style per-CPU variable accessors.

- Signal-handling RCU updates.

- Real-time updates.

- Torture-test updates.

- Miscellaneous fixes.

- Documentation updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2014-11-20 15:57:58 +0800

16 Nov, 2014

12 commits

36ce98818 sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK ... Browse Code »

Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with
the fair class.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-16 17:59:03 +0800
c1a2b5f62 sched/deadline: Remove unnecessary definitions in cpudeadline.h ... Browse Code »

Actually, cpudl_set() and cpudl_init() can never be used without
CONFIG_SMP.

Signed-off-by: pang.xunlei
Signed-off-by: Peter Zijlstra (Intel)
Cc: Steven Rostedt
Cc: Juri Lelli
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415260327-30465-4-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar

pang.xunlei
2014-11-16 17:59:00 +0800
74e6942fb sched/cpupri: Remove unnecessary definitions in cpupri.h ... Browse Code »

Actually, cpupri_set() and cpupri_init() can never be used without
CONFIG_SMP.

Signed-off-by: pang.xunlei
Signed-off-by: Peter Zijlstra (Intel)
Cc: Steven Rostedt
Cc: Juri Lelli
Cc: "pang.xunlei"
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415260327-30465-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar

pang.xunlei
2014-11-16 17:58:59 +0800
c51b8ab5a sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task() ... Browse Code »

Do not call dequeue_pushable_dl_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.

Actually the patch is the same behavior as commit 311e800e16f6 ("sched,
rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-16 17:58:57 +0800
cb0b9f244 sched/fair: Fix stale overloaded status in the busiest group finding logic ... Browse Code »

Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded > imbalance > other, and busiest group is picked according
to this order.

sgs->group_capacity_factor is used to check if the group is overloaded.

When the child domain prefers tasks to go to siblings first, the
sgs->group_capacity_factor will be set lower than one in order to
move all the excess tasks away.

However, group overloaded status is not updated when
sgs->group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.

This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Vincent Guittot
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-16 17:58:56 +0800
6c1d9410f sched: Move p->nr_cpus_allowed check to select_task_rq() ... Browse Code »

Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.

Suggested-and-Acked-by: Steven Rostedt
Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: "pang.xunlei"
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-16 17:58:55 +0800
a1bd53733 sched/completion: Document when to use wait_for_completion_io_*() ... Browse Code »

As discussed in [1], accounting IO is meant for blkio only. Document that
so driver authors won't use them for device io.

[1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470

Signed-off-by: Wolfram Sang
Signed-off-by: Peter Zijlstra (Intel)
Cc: One Thousand Gnomes
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.de
Signed-off-by: Ingo Molnar

Wolfram Sang
2014-11-16 17:58:54 +0800
753899183 sched/fair: Kill task_struct::numa_entry and numa_group::task_list ... Browse Code »

Nobody iterates over numa_group::task_list, this just confuses the readers.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-11-16 17:58:48 +0800
e9ac5f0fa Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-11-16 17:50:25 +0800
6e998916d sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency ... Browse Code »
48

Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include

/* Parameters for the Linux kernel ABI for CPU clocks. */
#define CPUCLOCK_SCHED 2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

static pthread_barrier_t barrier;

/* Help advance the clock. */
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1) ;

return NULL;
}

/* Don't use the glibc wrapper. */
static int do_nanosleep(int flags, const struct timespec *req)
{
clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}

static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

return after_i - before_i;
}

int main(void)
{
int result = 0;
pthread_t th;

pthread_barrier_init(&barrier, NULL, 2);

if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
perror("pthread_create");
return 1;
}

pthread_barrier_wait(&barrier);

/* The test. */
struct timespec before, after, sleeptimeabs;
int64_t sleepdiff, diffabs;
const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

/* The relative nanosleep. Not sure why this is needed, but its presence
seems to make it easier to reproduce the problem. */
if (do_nanosleep(0, &sleeptime) != 0) {
perror("clock_nanosleep");
return 1;
}

/* Get the current time. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
perror("clock_gettime[2]");
return 1;
}

/* Compute the absolute sleep time based on the current time. */
uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
sleeptimeabs.tv_nsec = nsec % 1000000000;

/* Sleep for the computed time. */
if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
perror("absolute clock_nanosleep");
return 1;
}

/* Get the time after the sleep. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
perror("clock_gettime[3]");
return 1;
}

/* The time after sleep should always be equal to or after the absolute sleep
time passed to clock_nanosleep. */
sleepdiff = tsdiff(&sleeptimeabs, &after);
if (sleepdiff < 0) {
printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
result = 1;

printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
}

/* The difference between the timestamps taken before and after the
clock_nanosleep call should be equal to or more than the duration of the
sleep. */
diffabs = tsdiff(&before, &after);
if (diffabs < sleeptime.tv_nsec) {
printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
result = 1;
}

pthread_cancel(th);

return result;
}

Signed-off-by: Stanislaw Gruszka
Signed-off-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Frederic Weisbecker
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar

Stanislaw Gruszka
2014-11-16 17:04:20 +0800
23cfa361f sched/cputime: Fix cpu_timer_sample_group() double accounting ... Browse Code »

While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:

cpu_timer_sample_group()
thread_group_cputimer()
thread_group_cputime()
times->sum_exec_runtime += task_sched_runtime();

*sample = cputime.sum_exec_runtime + task_delta_exec();

Which would make the sample run ahead, making the sleep short.

Signed-off-by: Peter Zijlstra (Intel)
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Stanislaw Gruszka
Cc: Christoph Lameter
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Rik van Riel
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-11-16 17:04:18 +0800
7af683350 sched/numa: Avoid selecting oneself as swap target ... Browse Code »

Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Reported-and-Tested-by: Sasha Levin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-11-16 17:04:17 +0800

10 Nov, 2014

1 commit

c123588b3 sched/numa: Fix out of bounds read in sched_init_numa() ... Browse Code »

On latest mm + KASan patchset I've got this:

==================================================================
BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
=============================================================================
BUG kmalloc-8 (Not tainted): kasan error
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
__slab_alloc+0x4b4/0x4f0
__kmalloc_track_caller+0x15f/0x1e0
kstrdup+0x44/0x90
alloc_vfsmnt+0xb0/0x2c0
vfs_kern_mount+0x35/0x190
kern_mount_data+0x25/0x50
pid_ns_prepare_proc+0x19/0x50
alloc_pid+0x5e2/0x630
copy_process.part.41+0xdf5/0x2aa0
do_fork+0xf5/0x460
kernel_thread+0x21/0x30
rest_init+0x1e/0x90
start_kernel+0x522/0x531
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x15b/0x16a
INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70

Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk.
Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........
Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
Call Trace:
dump_stack (lib/dump_stack.c:52)
print_trailer (mm/slub.c:645)
object_err (mm/slub.c:652)
? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
? kasan_poison_shadow (mm/kasan/kasan.c:48)
? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
? kasan_poison_shadow (mm/kasan/kasan.c:48)
? kasan_kmalloc (mm/kasan/kasan.c:311)
__asan_load4 (mm/kasan/kasan.c:371)
? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
kernel_init_freeable (init/main.c:869 init/main.c:997)
? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
? rest_init (init/main.c:924)
kernel_init (init/main.c:929)
? rest_init (init/main.c:924)
ret_from_fork (arch/x86/kernel/entry_64.S:348)
? rest_init (init/main.c:924)
Read of size 4 by task swapper/0:
Memory state around the buggy address:
ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
^
ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Zero 'level' (e.g. on non-NUMA system) causing out of bounds
access in this line:

sched_max_numa_distance = sched_domains_numa_distance[level - 1];

Fix this by exiting from sched_init_numa() earlier.

Signed-off-by: Andrey Ryabinin
Reviewed-by: Rik van Riel
Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar

Andrey Ryabinin
2014-11-10 17:33:22 +0800

04 Nov, 2014

14 commits

44dba3d5d sched: Refactor task_struct to use numa_faults instead of numa_* pointers ... Browse Code »
5

This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).

A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.

All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.

Signed-off-by: Iulia Manda
Signed-off-by: Peter Zijlstra (Intel)
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar

Iulia Manda
2014-11-04 14:17:57 +0800
cad3bb32e sched/deadline: Don't check CONFIG_SMP in switched_from_dl() ... Browse Code »

There are both UP and SMP version of pull_dl_task(), so don't need
to check CONFIG_SMP in switched_from_dl();

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-04 14:17:56 +0800
cd6609116 sched/deadline: Reschedule from switched_from_dl() after a successful pull ... Browse Code »

In switched_from_dl() we have to issue a resched if we successfully
pulled some task from other cpus. This patch also aligns the behavior
with -rt.

Suggested-by: Juri Lelli
Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-04 14:17:55 +0800
6b0a563f3 sched/deadline: Push task away if the deadline is equal to curr during wakeup ... Browse Code »

This patch pushes task away if the dealine of the task is equal
to current during wake up. The same behavior as rt class.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-04 14:17:55 +0800
acb32132e sched/deadline: Add deadline rq status print ... Browse Code »

This patch add deadline rq status print.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-04 14:17:54 +0800
804968809 sched/deadline: Fix artificial overrun introduced by yield_task_dl() ... Browse Code »

The yield semantic of deadline class is to reduce remaining runtime to
zero, and then update_curr_dl() will stop it. However, comsumed bandwidth
is reduced from the budget of yield task again even if it has already been
set to zero which leads to artificial overrun. This patch fix it by make
sure we don't steal some more time from the task that yielded in update_curr_dl().

Suggested-by: Juri Lelli
Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-04 14:17:53 +0800
308a623a4 sched/rt: Clean up check_preempt_equal_prio() ... Browse Code »

This patch checks if current can be pushed/pulled somewhere else
in advance to make logic clear, the same behavior as dl class.

- If current can't be migrated, useless to reschedule, let's hope
task can move out.
- If task is migratable, so let's not schedule it and see if it
can be pushed or pulled somewhere else.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Kirill Tkhai
Cc: Steven Rostedt
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414708776-124078-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-11-04 14:17:52 +0800
75e23e49d sched/core: Use dl_bw_of() under rcu_read_lock_sched() ... Browse Code »

As per commit f10e00f4bf36 ("sched/dl: Use dl_bw_of() under
rcu_read_lock_sched()"), dl_bw_of() has to be protected by
rcu_read_lock_sched().

Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414497286-28824-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-11-04 14:17:52 +0800
9f96742a1 sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu ... Browse Code »

Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu
after we have found an idle cpu.

Signed-off-by: Yao Dongdong
Reviewed-by: Srikar Dronamraju
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.com
Signed-off-by: Ingo Molnar

Yao Dongdong
2014-11-04 14:17:51 +0800
67dfa1b75 sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl() ... Browse Code »
39

Currently used hrtimer_try_to_cancel() is racy:

raw_spin_lock(&rq->lock)
... dl_task_timer raw_spin_lock(&rq->lock)
... raw_spin_lock(&rq->lock) ...
switched_from_dl() ... ...
hrtimer_try_to_cancel() ... ...
switched_to_fair() ... ...
... ... ...
... ... ...
raw_spin_unlock(&rq->lock) ... (asquired)
... ... ...
... ... ...
do_exit() ... ...
schedule() ... ...
raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock)
... ... ...
raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock)
... ... (asquired)
put_task_struct() ... ...
free_task_struct() ... ...
... ... raw_spin_unlock(&rq->lock)
... (asquired) ...
... ... ...
... (use after free) ...

So, let's implement 100% guaranteed way to cancel the timer and let's
be sure we are safe even in very unlikely situations.

rq unlocking does not limit the area of switched_from_dl() use, because
this has already been possible in pull_dl_task() below.

Let's consider the safety of of this unlocking. New code in the patch
is working when hrtimer_try_to_cancel() fails. This means the callback
is running. In this case hrtimer_cancel() is just waiting till the
callback is finished. Two

1) Since we are in switched_from_dl(), new class is not dl_sched_class and
new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
right after !dl_task() check. After that hrtimer_cancel() returns back too.

The above is:

raw_spin_lock(rq->lock); ...
... dl_task_timer()
... raw_spin_lock(rq->lock);
switched_from_dl() ...
hrtimer_try_to_cancel() ...
raw_spin_unlock(rq->lock); ...
hrtimer_cancel() ...
... raw_spin_unlock(rq->lock);
... return HRTIMER_NORESTART;
... ...
raw_spin_lock(rq->lock); ...

2) But the below is also possible:
dl_task_timer()
raw_spin_lock(rq->lock);
...
raw_spin_unlock(rq->lock);
raw_spin_lock(rq->lock); ...
switched_from_dl() ...
hrtimer_try_to_cancel() ...
... return HRTIMER_NORESTART;
raw_spin_unlock(rq->lock); ...
hrtimer_cancel(); ...
raw_spin_lock(rq->lock); ...

In this case hrtimer_cancel() returns immediately. Very unlikely case,
just to mention.

Nobody can manipulate the task, because check_class_changed() is
always called with pi_lock locked. Nobody can force the task to
participate in (concurrent) priority inheritance schemes (the same reason).

All concurrent task operations require pi_lock, which is held by us.
No deadlocks with dl_task_timer() are possible, because it returns
right after !dl_task() check (it does nothing).

If we receive a new dl_task during the time of unlocked rq, we just
don't have to do pull_dl_task() in switched_from_dl() further.

Signed-off-by: Kirill Tkhai
[ Added comments]
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Juri Lelli
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-11-04 14:17:50 +0800
e7097e8bd sched: Use WARN_ONCE for the might_sleep() TASK_RUNNING test ... Browse Code »

In some cases this can trigger a true flood of output.

Requested-by: Ingo Molnar
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-11-04 14:17:49 +0800
cb6538e74 sched/wait: Fix a kthread race with wait_woken() ... Browse Code »

There is a race between kthread_stop() and the new wait_woken() that
can result in a lack of progress.

CPU 0 | CPU 1
|
rfcomm_run() | kthread_stop()
... |
if (!test_bit(KTHREAD_SHOULD_STOP)) |
| set_bit(KTHREAD_SHOULD_STOP)
| wake_up_process()
wait_woken() | wait_for_completion()
set_current_state(INTERRUPTIBLE) |
if (!WQ_FLAG_WOKEN) |
schedule_timeout() |
|

After which both tasks will wait.. forever.

Fix this by having wait_woken() check for kthread_should_stop() but
only for kthreads (obviously).

Signed-off-by: Peter Zijlstra (Intel)
Cc: Peter Hurley
Cc: Oleg Nesterov
Cc: Linus Torvalds
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-11-04 14:17:44 +0800
f7b8a47da sched: Remove lockdep check in sched_move_task() ... Browse Code »

sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.

Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do no
need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.

Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.

Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
Reported-by: Oleg Nesterov
Reported-by: Fengguang Wu
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-11-04 14:07:30 +0800
38200cf24 rcu: Remove "cpu" argument to rcu_note_context_switch() ... Browse Code »

The "cpu" argument to rcu_note_context_switch() is always the current
CPU, so drop it. This in turn allows the "cpu" argument to
rcu_preempt_note_context_switch() to be removed, which allows the sole
use of "cpu" in both functions to be replaced with a this_cpu_ptr().
Again, the anticipated cross-CPU uses of these functions has been
replaced by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney
Reviewed-by: Pranith Kumar

Paul E. McKenney
2014-11-04 11:20:34 +0800

28 Oct, 2014

6 commits

3427445af sched: Exclude cond_resched() from nested sleep test ... Browse Code »

cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.

In particular, preemption preserves task_struct::state.

Signed-off-by: Peter Zijlstra (Intel)
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Alex Elder
Cc: Andrew Morton
Cc: Axel Lin
Cc: Daniel Borkmann
Cc: Dave Jones
Cc: Jason Baron
Cc: Linus Torvalds
Cc: Rusty Russell
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-10-28 17:56:57 +0800
8eb23b9f3 sched: Debug nested sleeps ... Browse Code »
39

Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.

Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.

Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.

Signed-off-by: Peter Zijlstra (Intel)
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-10-28 17:56:52 +0800
61ada528d sched/wait: Provide infrastructure to deal with nested blocking ... Browse Code »
2

There are a few places that call blocking primitives from wait loops,
provide infrastructure to support this without the typical
task_struct::state collision.

We record the wakeup in wait_queue_t::flags which leaves
task_struct::state free to be used by others.

Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Oleg Nesterov
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-10-28 17:55:15 +0800
f4e9d94a5 sched/deadline: Don't balance during wakeup if wakee is pinned ... Browse Code »

Use nr_cpus_allowed to bail from select_task_rq() when only one cpu
can be used, and saves some cycles for pinned tasks.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-10-28 17:48:02 +0800
1d7e974cb sched/deadline: Don't check SD_BALANCE_FORK ... Browse Code »

There is no need to do balance during fork since SCHED_DEADLINE
tasks can't fork. This patch avoid the SD_BALANCE_FORK check.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Link: http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng.li@linux.intel.com
Cc: Linus Torvalds
Signed-off-by: Ingo Molnar

Wanpeng Li
2014-10-28 17:48:01 +0800
f82f80426 sched/deadline: Ensure that updates to exclusive cpusets don't break AC ... Browse Code »
13

How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.

This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.

Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Li Zefan
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-10-28 17:48:00 +0800