16 Jan, 2015

3 commits

  • commit fd7de1e8d5b2b2b35e71332fafb899f584597150 upstream.

    Locklessly doing is_idle_task(rq->curr) is only okay because of
    RCU protection. The older variant of the broken code checked
    rq->curr == rq->idle instead and therefore didn't need RCU.

    Fixes: f6be8af1c95d ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
    Signed-off-by: Andy Lutomirski
    Reviewed-by: Chuansheng Liu
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     
  • commit 269ad8015a6b2bb1cf9e684da4921eb6fa0a0c88 upstream.

    The dl_runtime_exceeded() function is supposed to ckeck if
    a SCHED_DEADLINE task must be throttled, by checking if its
    current runtime is
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc: Dario Faggioli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Luca Abeni
     
  • commit 6a503c3be937d275113b702e0421e5b0720abe8a upstream.

    According to global EDF, tasks should be migrated between runqueues
    without checking if their scheduling deadlines and runtimes are valid.
    However, SCHED_DEADLINE currently performs such a check:
    a migration happens doing:

    deactivate_task(rq, next_task, 0);
    set_task_cpu(next_task, later_rq->cpu);
    activate_task(later_rq, next_task, 0);

    which ends up calling dequeue_task_dl(), setting the new CPU, and then
    calling enqueue_task_dl().

    enqueue_task_dl() then calls enqueue_dl_entity(), which calls
    update_dl_entity(), which can modify scheduling deadline and runtime,
    breaking global EDF scheduling.

    As a result, some of the properties of global EDF are not respected:
    for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
    two cores can have unbounded response times for the third task even
    if 30/80+40/80+120/170 = 1.5809 < 2

    This can be fixed by invoking update_dl_entity() only in case of
    wakeup, or if this is a new SCHED_DEADLINE task.

    Signed-off-by: Luca Abeni
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc: Dario Faggioli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Luca Abeni
     

04 Dec, 2014

1 commit

  • It appears that some SCHEDULE_USER (asm for schedule_user) callers
    in arch/x86/kernel/entry_64.S are called from RCU kernel context,
    and schedule_user will return in RCU user context. This causes RCU
    warnings and possible failures.

    This is intended to be a minimal fix suitable for 3.18.

    Reported-and-tested-by: Dave Jones
    Cc: Oleg Nesterov
    Cc: Frédéric Weisbecker
    Acked-by: Paul E. McKenney
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

24 Nov, 2014

1 commit

  • Chris bisected a NULL pointer deference in task_sched_runtime() to
    commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
    inconsistency'.

    Chris observed crashes in atop or other /proc walking programs when he
    started fork bombs on his machine. He assumed that this is a new exit
    race, but that does not make any sense when looking at that commit.

    What's interesting is that, the commit provides update_curr callbacks
    for all scheduling classes except stop_task and idle_task.

    While nothing can ever hit that via the clock_nanosleep() and
    clock_gettime() interfaces, which have been the target of the commit in
    question, the author obviously forgot that there are other code paths
    which invoke task_sched_runtime()

    do_task_stat(()
    thread_group_cputime_adjusted()
    thread_group_cputime()
    task_cputime()
    task_sched_runtime()
    if (task_current(rq, p) && task_on_rq_queued(p)) {
    update_rq_clock(rq);
    up->sched_class->update_curr(rq);
    }

    If the stats are read for a stomp machine task, aka 'migration/N' and
    that task is current on its cpu, this will happily call the NULL pointer
    of stop_task->update_curr. Ooops.

    Chris observation that this happens faster when he runs the fork bomb
    makes sense as the fork bomb will kick migration threads more often so
    the probability to hit the issue will increase.

    Add the missing update_curr callbacks to the scheduler classes stop_task
    and idle_task. While idle tasks cannot be monitored via /proc we have
    other means to hit the idle case.

    Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
    Reported-by: Chris Mason
    Reported-and-tested-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Stanislaw Gruszka
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

16 Nov, 2014

3 commits

  • Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
    test case in cost of breaking another one. After that commit, calling
    clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
    of Y time being smaller than X time.

    Reproducer/tester can be found further below, it can be compiled and ran by:

    gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
    while ./tst-cpuclock2 ; do : ; done

    This reproducer, when running on a buggy kernel, will complain
    about "clock_gettime difference too small".

    Issue happens because on start in thread_group_cputimer() we initialize
    sum_exec_runtime of cputimer with threads runtime not yet accounted and
    then add the threads runtime to running cputimer again on scheduler
    tick, making it's sum_exec_runtime bigger than actual threads runtime.

    KOSAKI Motohiro posted a fix for this problem, but that patch was never
    applied: https://lkml.org/lkml/2013/5/26/191 .

    This patch takes different approach to cure the problem. It calls
    update_curr() when cputimer starts, that assure we will have updated
    stats of running threads and on the next schedule tick we will account
    only the runtime that elapsed from cputimer start. That also assure we
    have consistent state between cpu times of individual threads and cpu
    time of the process consisted by those threads.

    Full reproducer (tst-cpuclock2.c):

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /* Parameters for the Linux kernel ABI for CPU clocks. */
    #define CPUCLOCK_SCHED 2
    #define MAKE_PROCESS_CPUCLOCK(pid, clock) \
    ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

    static pthread_barrier_t barrier;

    /* Help advance the clock. */
    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1) ;

    return NULL;
    }

    /* Don't use the glibc wrapper. */
    static int do_nanosleep(int flags, const struct timespec *req)
    {
    clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

    return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
    }

    static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
    {
    int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
    int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

    return after_i - before_i;
    }

    int main(void)
    {
    int result = 0;
    pthread_t th;

    pthread_barrier_init(&barrier, NULL, 2);

    if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
    perror("pthread_create");
    return 1;
    }

    pthread_barrier_wait(&barrier);

    /* The test. */
    struct timespec before, after, sleeptimeabs;
    int64_t sleepdiff, diffabs;
    const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

    /* The relative nanosleep. Not sure why this is needed, but its presence
    seems to make it easier to reproduce the problem. */
    if (do_nanosleep(0, &sleeptime) != 0) {
    perror("clock_nanosleep");
    return 1;
    }

    /* Get the current time. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
    perror("clock_gettime[2]");
    return 1;
    }

    /* Compute the absolute sleep time based on the current time. */
    uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
    sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
    sleeptimeabs.tv_nsec = nsec % 1000000000;

    /* Sleep for the computed time. */
    if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
    perror("absolute clock_nanosleep");
    return 1;
    }

    /* Get the time after the sleep. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
    perror("clock_gettime[3]");
    return 1;
    }

    /* The time after sleep should always be equal to or after the absolute sleep
    time passed to clock_nanosleep. */
    sleepdiff = tsdiff(&sleeptimeabs, &after);
    if (sleepdiff < 0) {
    printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
    result = 1;

    printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
    printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
    printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
    }

    /* The difference between the timestamps taken before and after the
    clock_nanosleep call should be equal to or more than the duration of the
    sleep. */
    diffabs = tsdiff(&before, &after);
    if (diffabs < sleeptime.tv_nsec) {
    printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
    result = 1;
    }

    pthread_cancel(th);

    return result;
    }

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • While looking over the cpu-timer code I found that we appear to add
    the delta for the calling task twice, through:

    cpu_timer_sample_group()
    thread_group_cputimer()
    thread_group_cputime()
    times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

    Which would make the sample run ahead, making the sleep short.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Stanislaw Gruszka
    Cc: Christoph Lameter
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Because the whole numa task selection stuff runs with preemption
    enabled (its long and expensive) we can end up migrating and selecting
    oneself as a swap target. This doesn't really work out well -- we end
    up trying to acquire the same lock twice for the swap migrate -- so
    avoid this.

    Reported-and-Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Nov, 2014

1 commit

  • On latest mm + KASan patchset I've got this:

    ==================================================================
    BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
    =============================================================================
    BUG kmalloc-8 (Not tainted): kasan error
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
    __slab_alloc+0x4b4/0x4f0
    __kmalloc_track_caller+0x15f/0x1e0
    kstrdup+0x44/0x90
    alloc_vfsmnt+0xb0/0x2c0
    vfs_kern_mount+0x35/0x190
    kern_mount_data+0x25/0x50
    pid_ns_prepare_proc+0x19/0x50
    alloc_pid+0x5e2/0x630
    copy_process.part.41+0xdf5/0x2aa0
    do_fork+0xf5/0x460
    kernel_thread+0x21/0x30
    rest_init+0x1e/0x90
    start_kernel+0x522/0x531
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x15b/0x16a
    INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
    INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70

    Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
    Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk.
    Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........
    Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
    CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
    ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
    ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    print_trailer (mm/slub.c:645)
    object_err (mm/slub.c:652)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_kmalloc (mm/kasan/kasan.c:311)
    __asan_load4 (mm/kasan/kasan.c:371)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kernel_init_freeable (init/main.c:869 init/main.c:997)
    ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
    ? rest_init (init/main.c:924)
    kernel_init (init/main.c:929)
    ? rest_init (init/main.c:924)
    ret_from_fork (arch/x86/kernel/entry_64.S:348)
    ? rest_init (init/main.c:924)
    Read of size 4 by task swapper/0:
    Memory state around the buggy address:
    ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
    ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
    ^
    ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Zero 'level' (e.g. on non-NUMA system) causing out of bounds
    access in this line:

    sched_max_numa_distance = sched_domains_numa_distance[level - 1];

    Fix this by exiting from sched_init_numa() earlier.

    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     

04 Nov, 2014

1 commit

  • sched_move_task() is the only interface to change sched_task_group:
    cpu_cgrp_subsys methods and autogroup_move_group() use it.

    Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
    is ordered with other users of sched_move_task(). This means we do no
    need RCU here: if we've dereferenced a tg here, the .attach method
    hasn't been called for it yet.

    Thus, we should pass "true" to task_css_check() to silence lockdep
    warnings.

    Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
    Reported-by: Oleg Nesterov
    Reported-by: Fengguang Wu
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

28 Oct, 2014

8 commits

  • 1) switched_to_dl() check is wrong. We reschedule only
    if rq->curr is deadline task, and we do not reschedule
    if it's a lower priority task. But we must always
    preempt a task of other classes.

    2) dl_task_timer():
    Policy does not change in case of priority inheritance.
    rt_mutex_setprio() changes prio, while policy remains old.

    So we lose some balancing logic in dl_task_timer() and
    switched_to_dl() when we check policy instead of priority. Boosted
    task may be rq->curr.

    (I didn't change switched_from_dl() because no check is necessary
    there at all).

    I've looked at this place(switched_to_dl) several times and even fixed
    this function, but found just now... I suppose some performance tests
    may work better after this.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • preempt_schedule_context() does preempt_enable_notrace() at the end
    and this can call the same function again; exception_exit() is heavy
    and it is quite possible that need-resched is true again.

    1. Change this code to dec preempt_count() and check need_resched()
    by hand.

    2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
    the enable/disable dance around __schedule(). But in this case
    we need to move into sched/core.c.

    3. Cosmetic, but x86 forgets to declare this function. This doesn't
    really matter because it is only called by asm helpers, still it
    make sense to add the declaration into asm/preempt.h to match
    preempt_schedule().

    Reported-by: Sasha Levin
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Graf
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Steven Rostedt
    Cc: Peter Anvin
    Cc: Andy Lutomirski
    Cc: Denys Vlasenko
    Cc: Chuck Ebbert
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

    This bash command reproduces problem:

    $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
    echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

    divide error: 0000 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
    RIP: 0010:[] [] task_scan_min+0x21/0x50
    RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
    RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
    RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
    R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
    R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
    FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
    Stack:
    ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
    ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
    ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
    Call Trace:
    [] task_scan_max+0x11/0x40
    [] task_numa_fault+0x1f7/0xae0
    [] ? migrate_misplaced_page+0x276/0x300
    [] handle_mm_fault+0x62d/0xba0
    [] __do_page_fault+0x191/0x510
    [] ? native_smp_send_reschedule+0x42/0x60
    [] ? check_preempt_curr+0x80/0xa0
    [] ? wake_up_new_task+0x11c/0x1a0
    [] ? do_fork+0x14d/0x340
    [] ? get_unused_fd_flags+0x2b/0x30
    [] ? __fd_install+0x1f/0x60
    [] do_page_fault+0xc/0x10
    [] page_fault+0x22/0x30
    RIP [] task_scan_min+0x21/0x50
    RSP
    ---[ end trace 9a826d16936c04de ]---

    Also fix race in task_scan_min (it depends on compiler behaviour).

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Tomlin
    Cc: Andrew Morton
    Cc: Dario Faggioli
    Cc: David Rientjes
    Cc: Jens Axboe
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • While offling node by hot removing memory, the following divide error
    occurs:

    divide error: 0000 [#1] SMP
    [...]
    Call Trace:
    [...] handle_mm_fault
    [...] ? try_to_wake_up
    [...] ? wake_up_state
    [...] __do_page_fault
    [...] ? do_futex
    [...] ? put_prev_entity
    [...] ? __switch_to
    [...] do_page_fault
    [...] page_fault
    [...]
    RIP [] task_numa_fault
    RSP

    The issue occurs as follows:
    1. When page fault occurs and page is allocated from node 1,
    task_struct->numa_faults_buffer_memory[] of node 1 is
    incremented and p->numa_faults_locality[] is also incremented
    as follows:

    o numa_faults_buffer_memory[] o numa_faults_locality[]
    NR_NUMA_HINT_FAULT_TYPES
    | 0 | 1 |
    ---------------------------------- ----------------------
    node 0 | 0 | 0 | remote | 0 |
    node 1 | 0 | 1 | locale | 1 |
    ---------------------------------- ----------------------

    2. node 1 is offlined by hot removing memory.

    3. When page fault occurs, fault_types[] is calculated by using
    p->numa_faults_buffer_memory[] of all online nodes in
    task_numa_placement(). But node 1 was offline by step 2. So
    the fault_types[] is calculated by using only
    p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
    are set to 0.

    4. The values(0) of fault_types[] pass to update_task_scan_period().

    5. numa_faults_locality[1] is set to 1. So the following division is
    calculated.

    static void update_task_scan_period(struct task_struct *p,
    unsigned long shared, unsigned long private){
    ...
    ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
    }

    6. But both of private and shared are set to 0. So divide error
    occurs here.

    The divide error is rare case because the trigger is node offline.
    This patch always increments denominator for avoiding divide error.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
    Signed-off-by: Ingo Molnar

    Yasuaki Ishimatsu
     
  • Unlocked access to dst_rq->curr in task_numa_compare() is racy.
    If curr task is exiting this may be a reason of use-after-free:

    task_numa_compare() do_exit()
    ... current->flags |= PF_EXITING;
    ... release_task()
    ... ~~delayed_put_task_struct()~~
    ... schedule()
    rcu_read_lock() ...
    cur = ACCESS_ONCE(dst_rq->curr) ...
    ... rq->curr = next;
    ... context_switch()
    ... finish_task_switch()
    ... put_task_struct()
    ... __put_task_struct()
    ... free_task_struct()
    task_numa_assign() ...
    get_task_struct() ...

    As noted by Oleg:

    <
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • dl_task_timer() is racy against several paths. Daniel noticed that
    the replenishment timer may experience a race condition against an
    enqueue_dl_entity() called from rt_mutex_setprio(). With his own
    words:

    rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
    start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
    sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
    throttled is 0

    => BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
    enqueued on the -deadline runqueue.

    As we do for the other races, we just bail out in the replenishment
    timer code.

    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: vincent@legout.info
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • In the deboost path, right after the dl_boosted flag has been
    reset, we can currently end up replenishing using -deadline
    parameters of a !SCHED_DEADLINE entity. This of course causes
    a bug, as those parameters are empty.

    In the case depicted above it is safe to simply bail out, as
    the deboosted task is going to be back to its original scheduling
    class anyway.

    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: vincent@legout.info
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • The race may happen when somebody is changing task_group of a forking task.
    Child's cgroup is the same as parent's after dup_task_struct() (there just
    memory copying). Also, cfs_rq and rt_rq are the same as parent's.

    But if parent changes its task_group before it's called cgroup_post_fork(),
    we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
    the same, while child's task_group changes in cgroup_post_fork().

    To fix this we introduce fork() method, which calls sched_move_task() directly.
    This function changes sched_task_group on appropriate (also its logic has
    no problem with freshly created tasks, so we shouldn't introduce something
    special; we are able just to use it).

    Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

15 Oct, 2014

1 commit

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds
     

13 Oct, 2014

2 commits

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
    Hansen)

    - Various sched/idle refinements for better idle handling (Nicolas
    Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

    - sched/numa updates and optimizations (Rik van Riel)

    - sysbench speedup (Vincent Guittot)

    - capacity calculation cleanups/refactoring (Vincent Guittot)

    - Various cleanups to thread group iteration (Oleg Nesterov)

    - Double-rq-lock removal optimization and various refactorings
    (Kirill Tkhai)

    - various sched/deadline fixes

    ... and lots of other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
    sched/fair: Delete resched_cpu() from idle_balance()
    sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
    sched: Improve sysbench performance by fixing spurious active migration
    sched/x86: Fix up typo in topology detection
    x86, sched: Add new topology for multi-NUMA-node CPUs
    sched/rt: Use resched_curr() in task_tick_rt()
    sched: Use rq->rd in sched_setaffinity() under RCU read lock
    sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
    sched: Use dl_bw_of() under RCU read lock
    sched/fair: Remove duplicate code from can_migrate_task()
    sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
    sched: print_rq(): Don't use tasklist_lock
    sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
    sched: Fix the task-group check in tg_has_rt_tasks()
    sched/fair: Leverage the idle state info when choosing the "idlest" cpu
    sched: Let the scheduler see CPU idle states
    sched/deadline: Fix inter- exclusive cpusets migrations
    sched/deadline: Clear dl_entity params when setscheduling to different class
    sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
    ...

    Linus Torvalds
     
  • Pull core locking updates from Ingo Molnar:
    "The main updates in this cycle were:

    - mutex MCS refactoring finishing touches: improve comments, refactor
    and clean up code, reduce debug data structure footprint, etc.

    - qrwlock finishing touches: remove old code, self-test updates.

    - small rwsem optimization

    - various smaller fixes/cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/lockdep: Revert qrwlock recusive stuff
    locking/rwsem: Avoid double checking before try acquiring write lock
    locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition
    locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S
    locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code
    locking/semaphore: Resolve some shadow warnings
    locking/selftest: Support queued rwlock
    locking/lockdep: Restrict the use of recursive read_lock() with qrwlock
    locking/spinlocks: Always evaluate the second argument of spin_lock_nested()
    locking/Documentation: Update locking/mutex-design.txt disadvantages
    locking/Documentation: Move locking related docs into Documentation/locking/
    locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriate
    locking/mutexes: Refactor optimistic spinning code
    locking/mcs: Remove obsolete comment
    locking/mutexes: Document quick lock release when unlocking
    locking/mutexes: Standardize arguments in lock/unlock slowpaths
    locking: Remove deprecated smp_mb__() barriers

    Linus Torvalds
     

10 Oct, 2014

1 commit

  • 1. vma_policy_mof(task) is simply not safe unless task == current,
    it can race with do_exit()->mpol_put(). Remove this arg and update
    its single caller.

    2. vma can not be NULL, remove this check and simplify the code.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Oct, 2014

1 commit

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable fixes:
    - fix an NFSv4.1 state renewal regression
    - fix open/lock state recovery error handling
    - fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
    - fix statd when reconnection fails
    - don't wake tasks during connection abort
    - don't start reboot recovery if lease check fails
    - fix duplicate proc entries

    Features:
    - pNFS block driver fixes and clean ups from Christoph
    - More code cleanups from Anna
    - Improve mmap() writeback performance
    - Replace use of PF_TRANS with a more generic mechanism for avoiding
    deadlocks in nfs_release_page"

    * tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
    NFSv4.1: Fix an NFSv4.1 state renewal regression
    NFSv4: fix open/lock state recovery error handling
    NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
    NFS: Fabricate fscache server index key correctly
    SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
    NFSv3: Fix missing includes of nfs3_fs.h
    NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
    NFS: avoid waiting at all in nfs_release_page when congested.
    NFS: avoid deadlocks with loop-back mounted NFS filesystems.
    MM: export page_wakeup functions
    SCHED: add some "wait..on_bit...timeout()" interfaces.
    NFS: don't use STABLE writes during writeback.
    NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
    rpc: Add -EPERM processing for xs_udp_send_request()
    rpc: return sent and err from xs_sendpages()
    lockd: Try to reconnect if statd has moved
    SUNRPC: Don't wake tasks during connection abort
    Fixing lease renewal
    nfs: fix duplicate proc entries
    pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
    ...

    Linus Torvalds
     

03 Oct, 2014

4 commits

  • rq->rd is freed using call_rcu_sched(), so rcu_read_lock() to access it
    is not enough. We should use either rcu_read_lock_sched() or preempt_disable().

    Reported-by: Sasha Levin
    Suggested-by: Peter Zijlstra
    Signed-off-by: Kirill Tkhai
    Fixes: 66339c31bc39 "sched: Use dl_bw_of() under RCU read lock"
    Link: http://lkml.kernel.org/r/1412065417.20287.24.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
    if this is necessary.

    Furthermore, a higher priority class task may be current on dest rq,
    we shouldn't disturb it.

    Signed-off-by: Kirill Tkhai
    Cc: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • On 32 bit systems cmpxchg cannot handle 64 bit values, so
    some additional magic is required to allow a 32 bit system
    with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build.

    Make sure the correct cmpxchg function is used when doing
    an atomic swap of a cputime_t.

    Reported-by: Arnd Bergmann
    Signed-off-by: Rik van Riel
    Acked-by: Arnd Bergmann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: umgwanakikbuti@gmail.com
    Cc: fweisbec@gmail.com
    Cc: srao@redhat.com
    Cc: lwoodman@redhat.com
    Cc: atheurer@redhat.com
    Cc: oleg@redhat.com
    Cc: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: linux390@de.ibm.com
    Cc: linux-arch@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-s390@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Since commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() ...")
    sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
    but is only more loaded than others. This change has been introduced to ensure
    a better load balance in system that are not overloaded but as a side effect,
    it can also generate useless active migration between groups.

    Let take the example of 3 tasks on a quad cores system. We will always have an
    idle core so the load balance will find a busiest group (core) whenever an ILB
    is triggered and it will force an active migration (once above
    nr_balance_failed threshold) so the idle core becomes busy but another core
    will become idle. With the next ILB, the freshly idle core will try to pull the
    task of a busy CPU.
    The number of spurious active migration is not so huge in quad core system
    because the ILB is not triggered so much. But it becomes significant as soon as
    you have more than one sched_domain level like on a dual cluster of quad cores
    where the ILB is triggered every tick when you have more than 1 busy_cpu

    We need to ensure that the migration generate a real improveùent and will not
    only move the avg_load imbalance on another CPU.

    Before caeb178c60f4f93f1b45c0bc056b5cf6d217b67f, the filtering of such use
    case was ensured by the following test in f_b_g:

    if ((local->idle_cpus < busiest->idle_cpus) &&
    busiest->sum_nr_running group_weight)

    This patch modified the condition to take into account situation where busiest
    group is not overloaded: If the diff between the number of idle cpus in 2
    groups is less than or equal to 1 and the busiest group is not overloaded,
    moving a task will not improve the load balance but just move it.

    A test with sysbench on a dual clusters of quad cores gives the following
    results:

    command: sysbench --test=cpu --num-threads=5 --max-time=5 run

    The HZ is 200 which means that 1000 ticks has fired during the test.

    With Mainline, perf gives the following figures:

    Samples: 727 of event 'sched:sched_migrate_task'
    Event count (approx.): 727
    Overhead Command Shared Object Symbol
    ........ ............... ............. ..............
    12.52% migration/1 [unknown] [.] 00000000
    12.52% migration/5 [unknown] [.] 00000000
    12.52% migration/7 [unknown] [.] 00000000
    12.10% migration/6 [unknown] [.] 00000000
    11.83% migration/0 [unknown] [.] 00000000
    11.83% migration/3 [unknown] [.] 00000000
    11.14% migration/4 [unknown] [.] 00000000
    10.87% migration/2 [unknown] [.] 00000000
    2.75% sysbench [unknown] [.] 00000000
    0.83% swapper [unknown] [.] 00000000
    0.55% ktps65090charge [unknown] [.] 00000000
    0.41% mmcqd/1 [unknown] [.] 00000000
    0.14% perf [unknown] [.] 00000000

    With this patch, perf gives the following figures

    Samples: 20 of event 'sched:sched_migrate_task'
    Event count (approx.): 20
    Overhead Command Shared Object Symbol
    ........ ............... ............. ..............
    80.00% sysbench [unknown] [.] 00000000
    10.00% swapper [unknown] [.] 00000000
    5.00% ktps65090charge [unknown] [.] 00000000
    5.00% migration/1 [unknown] [.] 00000000

    Signed-off-by: Vincent Guittot
    Reviewed-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

25 Sep, 2014

1 commit

  • In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
    sched: Allow wait_on_bit_action() functions to support a timeout

    I suggested that a "wait_on_bit_timeout()" interface would not meet my
    need. This isn't true - I was just over-engineering.

    Including a 'private' field in wait_bit_key instead of a focused
    "timeout" field was just premature generalization. If some other
    use is ever found, it can be generalized or added later.

    So this patch renames "private" to "timeout" with a meaning "stop
    waiting when "jiffies" reaches or passes "timeout",
    and adds two of the many possible wait..bit..timeout() interfaces:

    wait_on_page_bit_killable_timeout(), which is the one I want to use,
    and out_of_line_wait_on_bit_timeout() which is a reasonably general
    example. Others can be added as needed.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: NeilBrown
    Acked-by: Ingo Molnar
    Signed-off-by: Trond Myklebust

    NeilBrown
     

24 Sep, 2014

12 commits

  • Some time ago PREEMPT_NEED_RESCHED was implemented,
    so reschedule technics is a little more difficult now.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140922183642.11015.66039.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Probability of use-after-free isn't zero in this place.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: # v3.14+
    Cc: Paul E. McKenney
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Nothing is locked there, so label's name only confuses a reader.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/20140922183630.11015.59500.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • dl_bw_of() dereferences rq->rd which has to have RCU read lock held.
    Probability of use-after-free isn't zero here.

    Also add lockdep assert into dl_bw_cpus().

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: # v3.14+
    Cc: Paul E. McKenney
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Combine two branches which do the same.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140922183612.11015.64200.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Kirill found that there's a subtle race in the
    __ARCH_WANT_UNLOCKED_CTXSW code, and instead of fixing it, remove the
    entire exception because neither arch that uses it seems to actually
    still require it.

    Boot tested on mips64el (qemu) only.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Kirill Tkhai
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Fenghua Yu
    Cc: James Hogan
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Paul Burton
    Cc: Qais Yousef
    Cc: Ralf Baechle
    Cc: Tony Luck
    Cc: oleg@redhat.com
    Cc: linux@roeck-us.net
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mips@linux-mips.org
    Link: http://lkml.kernel.org/r/20140923150641.GH3312@worktop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • read_lock_irqsave(tasklist_lock) in print_rq() looks strange. We do
    not need to disable irqs, and they are already disabled by the caller.

    And afaics this lock buys nothing, we can rely on rcu_read_lock().
    In this case it makes sense to also move rcu_read_lock/unlock from
    the caller to print_rq().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Mike Galbraith
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140921193341.GA28628@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • 1. read_lock(tasklist_lock) does not need to disable irqs.

    2. ->mm != NULL is a common mistake, use PF_KTHREAD.

    3. The second ->mm check can be simply removed.

    4. task_rq_lock() looks better than raw_spin_lock(&p->pi_lock) +
    __task_rq_lock().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Mike Galbraith
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140921193338.GA28621@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • tg_has_rt_tasks() wants to find an RT task in this task_group, but
    task_rq(p)->rt.tg wrongly checks the root rt_rq.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/20140921193336.GA28618@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The code in find_idlest_cpu() looks for the CPU with the smallest load.
    However, if multiple CPUs are idle, the first idle CPU is selected
    irrespective of the depth of its idle state.

    Among the idle CPUs we should pick the one with with the shallowest idle
    state, or the latest to have gone idle if all idle CPUs are in the same
    state. The later applies even when cpuidle is configured out.

    This patch doesn't cover the following issues:

    - The idle exit latency of a CPU might be larger than the time needed
    to migrate the waking task to an already running CPU with sufficient
    capacity, and therefore performance would benefit from task packing
    in such case (in most cases task packing is about power saving).

    - Some idle states have a non negligible and non abortable entry latency
    which needs to run to completion before the exit latency can start.
    A concurrent patch series is making this info available to the cpuidle
    core. Once available, the entry latency with the idle timestamp could
    determine when the exit latency may be effective.

    Those issues will be handled in due course. In the mean time, what
    is implemented here should improve things already compared to the current
    state of affairs.

    Based on an initial patch from Daniel Lezcano.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Daniel Lezcano
    Cc: "Rafael J. Wysocki"
    Cc: Linus Torvalds
    Cc: linux-pm@vger.kernel.org
    Cc: linaro-kernel@lists.linaro.org
    Link: http://lkml.kernel.org/n/tip-@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • When the cpu enters idle, it stores the cpuidle state pointer in its
    struct rq instance which in turn could be used to make a better decision
    when balancing tasks.

    As soon as the cpu exits its idle state, the struct rq reference is
    cleared.

    There are a couple of situations where the idle state pointer could be changed
    while it is being consulted:

    1. For x86/acpi with dynamic c-states, when a laptop switches from battery
    to AC that could result on removing the deeper idle state. The acpi driver
    triggers:
    'acpi_processor_cst_has_changed'
    'cpuidle_pause_and_lock'
    'cpuidle_uninstall_idle_handler'
    'kick_all_cpus_sync'.

    All cpus will exit their idle state and the pointed object will be set to
    NULL.

    2. The cpuidle driver is unloaded. Logically that could happen but not
    in practice because the drivers are always compiled in and 95% of them are
    not coded to unregister themselves. In any case, the unloading code must
    call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
    leading to 'kick_all_cpus_sync' as mentioned above.

    A race can happen if we use the pointer and then one of these two scenarios
    occurs at the same moment.

    In order to be safe, the idle state pointer stored in the rq must be
    used inside a rcu_read_lock section where we are protected with the
    'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
    idle_get_state() and idle_put_state() accessors should be used to that
    effect.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: linux-pm@vger.kernel.org
    Cc: linaro-kernel@lists.linaro.org
    Cc: Daniel Lezcano
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-@git.kernel.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     
  • Users can perform clustered scheduling using the cpuset facility.
    After an exclusive cpuset is created, task migrations happen only
    between CPUs belonging to the same cpuset. Inter- cpuset migrations
    can only happen when the user requires so, moving a task between
    different cpusets. This behaviour is broken in SCHED_DEADLINE, as
    currently spurious inter- cpuset migration may happen without user
    intervention.

    This patch fix the problem (and shuffles the code a bit to improve
    clarity).

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: raistlin@linux.it
    Cc: michael@amarulasolutions.com
    Cc: fchecconi@gmail.com
    Cc: daniel.wagner@bmw-carit.de
    Cc: vincent@legout.info
    Cc: luca.abeni@unitn.it
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1411118561-26323-4-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli