09 Jan, 2015

2 commits

  • When alloc_fair_sched_group() in sched_create_group() fails,
    free_sched_group() is called, and free_fair_sched_group() is called by
    free_sched_group(). Since destroy_cfs_bandwidth() is called by
    free_fair_sched_group() without calling init_cfs_bandwidth(),
    RCU stall occurs at hrtimer_cancel():

    INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies g=13074 c=13073 q=0)
    Task dump for CPU 1:
    (fprintd) R running task 0 6249 1 0x00000088
    ...
    Call Trace:
    [] sched_show_task+0xa8/0x110
    [] dump_cpu_task+0x3d/0x50
    [] rcu_dump_cpu_stacks+0x90/0xd0
    [] rcu_check_callbacks+0x491/0x700
    [] update_process_times+0x4b/0x80
    [] tick_sched_handle.isra.20+0x36/0x50
    [] tick_sched_timer+0x42/0x70
    [] __run_hrtimer+0x69/0x1a0
    [] ? tick_sched_handle.isra.20+0x50/0x50
    [] hrtimer_interrupt+0xef/0x230
    [] local_apic_timer_interrupt+0x3b/0x70
    [] smp_apic_timer_interrupt+0x45/0x60
    [] apic_timer_interrupt+0x6d/0x80
    [] ? lock_hrtimer_base.isra.23+0x18/0x50
    [] ? __kmalloc+0x211/0x230
    [] hrtimer_try_to_cancel+0x22/0xd0
    [] ? __kmalloc+0x211/0x230
    [] hrtimer_cancel+0x22/0x30
    [] free_fair_sched_group+0x25/0xd0
    [] free_sched_group+0x16/0x40
    [] sched_create_group+0x4b/0x80
    [] sched_autogroup_create_attach+0x43/0x1c0
    [] sys_setsid+0x7c/0x110
    [] system_call_fastpath+0x12/0x17

    Check whether init_cfs_bandwidth() was called before calling
    destroy_cfs_bandwidth().

    Signed-off-by: Tetsuo Handa
    [ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Paul Turner
    Cc: Ben Segall
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
    Signed-off-by: Ingo Molnar

    Tetsuo Handa
     
  • In effective_load, we have (long w * unsigned long tg->shares) / long W,
    when w is negative, it is cast to unsigned long and hence the product is
    insanely large. Fix this by casting tg->shares to long.

    Reported-by: Sasha Levin
    Signed-off-by: Yuyang Du
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dave Jones
    Cc: Andrey Ryabinin
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
    Signed-off-by: Ingo Molnar

    Yuyang Du
     

16 Nov, 2014

6 commits

  • Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return
    'true' on a busier sd") changes groups to be ranked in the order of
    overloaded > imbalance > other, and busiest group is picked according
    to this order.

    sgs->group_capacity_factor is used to check if the group is overloaded.

    When the child domain prefers tasks to go to siblings first, the
    sgs->group_capacity_factor will be set lower than one in order to
    move all the excess tasks away.

    However, group overloaded status is not updated when
    sgs->group_capacity_factor is set to lower than one, which leads to us
    missing to find the busiest group.

    This patch fixes it by updating group overloaded status when sg capacity
    factor is set to one, in order to find the busiest group accurately.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Vincent Guittot
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
    [ Fixed the changelog. ]
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
    This change will make fair.c, rt.c, and deadline.c all start with the
    same logic.

    Suggested-and-Acked-by: Steven Rostedt
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "pang.xunlei"
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Nobody iterates over numa_group::task_list, this just confuses the readers.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
    test case in cost of breaking another one. After that commit, calling
    clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
    of Y time being smaller than X time.

    Reproducer/tester can be found further below, it can be compiled and ran by:

    gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
    while ./tst-cpuclock2 ; do : ; done

    This reproducer, when running on a buggy kernel, will complain
    about "clock_gettime difference too small".

    Issue happens because on start in thread_group_cputimer() we initialize
    sum_exec_runtime of cputimer with threads runtime not yet accounted and
    then add the threads runtime to running cputimer again on scheduler
    tick, making it's sum_exec_runtime bigger than actual threads runtime.

    KOSAKI Motohiro posted a fix for this problem, but that patch was never
    applied: https://lkml.org/lkml/2013/5/26/191 .

    This patch takes different approach to cure the problem. It calls
    update_curr() when cputimer starts, that assure we will have updated
    stats of running threads and on the next schedule tick we will account
    only the runtime that elapsed from cputimer start. That also assure we
    have consistent state between cpu times of individual threads and cpu
    time of the process consisted by those threads.

    Full reproducer (tst-cpuclock2.c):

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /* Parameters for the Linux kernel ABI for CPU clocks. */
    #define CPUCLOCK_SCHED 2
    #define MAKE_PROCESS_CPUCLOCK(pid, clock) \
    ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

    static pthread_barrier_t barrier;

    /* Help advance the clock. */
    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1) ;

    return NULL;
    }

    /* Don't use the glibc wrapper. */
    static int do_nanosleep(int flags, const struct timespec *req)
    {
    clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

    return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
    }

    static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
    {
    int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
    int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

    return after_i - before_i;
    }

    int main(void)
    {
    int result = 0;
    pthread_t th;

    pthread_barrier_init(&barrier, NULL, 2);

    if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
    perror("pthread_create");
    return 1;
    }

    pthread_barrier_wait(&barrier);

    /* The test. */
    struct timespec before, after, sleeptimeabs;
    int64_t sleepdiff, diffabs;
    const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

    /* The relative nanosleep. Not sure why this is needed, but its presence
    seems to make it easier to reproduce the problem. */
    if (do_nanosleep(0, &sleeptime) != 0) {
    perror("clock_nanosleep");
    return 1;
    }

    /* Get the current time. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
    perror("clock_gettime[2]");
    return 1;
    }

    /* Compute the absolute sleep time based on the current time. */
    uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
    sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
    sleeptimeabs.tv_nsec = nsec % 1000000000;

    /* Sleep for the computed time. */
    if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
    perror("absolute clock_nanosleep");
    return 1;
    }

    /* Get the time after the sleep. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
    perror("clock_gettime[3]");
    return 1;
    }

    /* The time after sleep should always be equal to or after the absolute sleep
    time passed to clock_nanosleep. */
    sleepdiff = tsdiff(&sleeptimeabs, &after);
    if (sleepdiff < 0) {
    printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
    result = 1;

    printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
    printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
    printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
    }

    /* The difference between the timestamps taken before and after the
    clock_nanosleep call should be equal to or more than the duration of the
    sleep. */
    diffabs = tsdiff(&before, &after);
    if (diffabs < sleeptime.tv_nsec) {
    printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
    result = 1;
    }

    pthread_cancel(th);

    return result;
    }

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • Because the whole numa task selection stuff runs with preemption
    enabled (its long and expensive) we can end up migrating and selecting
    oneself as a swap target. This doesn't really work out well -- we end
    up trying to acquire the same lock twice for the swap migrate -- so
    avoid this.

    Reported-and-Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Nov, 2014

2 commits

  • This patch simplifies task_struct by removing the four numa_* pointers
    in the same array and replacing them with the array pointer. By doing this,
    on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
    x86_64).

    A new parameter is added to the task_faults_idx function so that it can return
    an index to the correct offset, corresponding with the old precalculated
    pointers.

    All of the code in sched/ that depended on task_faults_idx and numa_* was
    changed in order to match the new logic.

    Signed-off-by: Iulia Manda
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: mgorman@suse.de
    Cc: dave@stgolabs.net
    Cc: riel@redhat.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
    Signed-off-by: Ingo Molnar

    Iulia Manda
     
  • Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu
    after we have found an idle cpu.

    Signed-off-by: Yao Dongdong
    Reviewed-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.com
    Signed-off-by: Ingo Molnar

    Yao Dongdong
     

28 Oct, 2014

7 commits

  • In pseudo-interleaved numa_groups, all tasks try to relocate to
    the group's preferred_nid. When a group is spread across multiple
    NUMA nodes, this can lead to tasks swapping their location with
    other tasks inside the same group, instead of swapping location with
    tasks from other NUMA groups. This can keep NUMA groups from converging.

    Examining all nodes, when dealing with a task in a pseudo-interleaved
    NUMA group, avoids this problem. Note that only CPUs in nodes that
    improve the task or group score are examined, so the loop isn't too
    bad.

    Tested-by: Vinod Chegu
    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "Vinod Chegu"
    Cc: mgorman@suse.de
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141009172747.0d97c38c@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • On systems with complex NUMA topologies, the node scoring is adjusted
    to allow workloads to converge on nodes that are near each other.

    The way a task group's preferred nid is determined needs to be adjusted,
    in order for the preferred_nid to be consistent with group_weight scoring.
    This ensures that we actually try to converge workloads on adjacent nodes.

    Signed-off-by: Rik van Riel
    Tested-by: Chegu Vinod
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: mgorman@suse.de
    Cc: chegu_vinod@hp.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413530994-9732-6-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • In order to do task placement on systems with complex NUMA topologies,
    it is necessary to count the faults on nodes nearby the node that is
    being examined for a potential move.

    In case of a system with a backplane interconnect, we are dealing with
    groups of NUMA nodes; each of the nodes within a group is the same number
    of hops away from nodes in other groups in the system. Optimal placement
    on this topology is achieved by counting all nearby nodes equally. When
    comparing nodes A and B at distance N, nearby nodes are those at distances
    smaller than N from nodes A or B.

    Placement strategy on a system with a glueless mesh NUMA topology needs
    to be different, because there are no natural groups of nodes determined
    by the hardware. Instead, when dealing with two nodes A and B at distance
    N, N >= 2, there will be intermediate nodes at distance < N from both nodes
    A and B. Good placement can be achieved by right shifting the faults on
    nearby nodes by the number of hops from the node being scored. In this
    context, a nearby node is any node less than the maximum distance in the
    system away from the node. Those nodes are skipped for efficiency reasons,
    there is no real policy reason to do so.

    Placement policy on directly connected NUMA systems is not affected.

    Signed-off-by: Rik van Riel
    Tested-by: Chegu Vinod
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: mgorman@suse.de
    Cc: chegu_vinod@hp.com
    Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Preparatory patch for adding NUMA placement on systems with
    complex NUMA topology. Also fix a potential divide by zero
    in group_weight()

    Signed-off-by: Rik van Riel
    Tested-by: Chegu Vinod
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: mgorman@suse.de
    Cc: chegu_vinod@hp.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

    This bash command reproduces problem:

    $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
    echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

    divide error: 0000 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
    RIP: 0010:[] [] task_scan_min+0x21/0x50
    RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
    RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
    RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
    R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
    R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
    FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
    Stack:
    ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
    ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
    ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
    Call Trace:
    [] task_scan_max+0x11/0x40
    [] task_numa_fault+0x1f7/0xae0
    [] ? migrate_misplaced_page+0x276/0x300
    [] handle_mm_fault+0x62d/0xba0
    [] __do_page_fault+0x191/0x510
    [] ? native_smp_send_reschedule+0x42/0x60
    [] ? check_preempt_curr+0x80/0xa0
    [] ? wake_up_new_task+0x11c/0x1a0
    [] ? do_fork+0x14d/0x340
    [] ? get_unused_fd_flags+0x2b/0x30
    [] ? __fd_install+0x1f/0x60
    [] do_page_fault+0xc/0x10
    [] page_fault+0x22/0x30
    RIP [] task_scan_min+0x21/0x50
    RSP
    ---[ end trace 9a826d16936c04de ]---

    Also fix race in task_scan_min (it depends on compiler behaviour).

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Tomlin
    Cc: Andrew Morton
    Cc: Dario Faggioli
    Cc: David Rientjes
    Cc: Jens Axboe
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • While offling node by hot removing memory, the following divide error
    occurs:

    divide error: 0000 [#1] SMP
    [...]
    Call Trace:
    [...] handle_mm_fault
    [...] ? try_to_wake_up
    [...] ? wake_up_state
    [...] __do_page_fault
    [...] ? do_futex
    [...] ? put_prev_entity
    [...] ? __switch_to
    [...] do_page_fault
    [...] page_fault
    [...]
    RIP [] task_numa_fault
    RSP

    The issue occurs as follows:
    1. When page fault occurs and page is allocated from node 1,
    task_struct->numa_faults_buffer_memory[] of node 1 is
    incremented and p->numa_faults_locality[] is also incremented
    as follows:

    o numa_faults_buffer_memory[] o numa_faults_locality[]
    NR_NUMA_HINT_FAULT_TYPES
    | 0 | 1 |
    ---------------------------------- ----------------------
    node 0 | 0 | 0 | remote | 0 |
    node 1 | 0 | 1 | locale | 1 |
    ---------------------------------- ----------------------

    2. node 1 is offlined by hot removing memory.

    3. When page fault occurs, fault_types[] is calculated by using
    p->numa_faults_buffer_memory[] of all online nodes in
    task_numa_placement(). But node 1 was offline by step 2. So
    the fault_types[] is calculated by using only
    p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
    are set to 0.

    4. The values(0) of fault_types[] pass to update_task_scan_period().

    5. numa_faults_locality[1] is set to 1. So the following division is
    calculated.

    static void update_task_scan_period(struct task_struct *p,
    unsigned long shared, unsigned long private){
    ...
    ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
    }

    6. But both of private and shared are set to 0. So divide error
    occurs here.

    The divide error is rare case because the trigger is node offline.
    This patch always increments denominator for avoiding divide error.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
    Signed-off-by: Ingo Molnar

    Yasuaki Ishimatsu
     
  • Unlocked access to dst_rq->curr in task_numa_compare() is racy.
    If curr task is exiting this may be a reason of use-after-free:

    task_numa_compare() do_exit()
    ... current->flags |= PF_EXITING;
    ... release_task()
    ... ~~delayed_put_task_struct()~~
    ... schedule()
    rcu_read_lock() ...
    cur = ACCESS_ONCE(dst_rq->curr) ...
    ... rq->curr = next;
    ... context_switch()
    ... finish_task_switch()
    ... put_task_struct()
    ... __put_task_struct()
    ... free_task_struct()
    task_numa_assign() ...
    get_task_struct() ...

    As noted by Oleg:

    <
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

15 Oct, 2014

1 commit

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds
     

13 Oct, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
    Hansen)

    - Various sched/idle refinements for better idle handling (Nicolas
    Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

    - sched/numa updates and optimizations (Rik van Riel)

    - sysbench speedup (Vincent Guittot)

    - capacity calculation cleanups/refactoring (Vincent Guittot)

    - Various cleanups to thread group iteration (Oleg Nesterov)

    - Double-rq-lock removal optimization and various refactorings
    (Kirill Tkhai)

    - various sched/deadline fixes

    ... and lots of other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
    sched/fair: Delete resched_cpu() from idle_balance()
    sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
    sched: Improve sysbench performance by fixing spurious active migration
    sched/x86: Fix up typo in topology detection
    x86, sched: Add new topology for multi-NUMA-node CPUs
    sched/rt: Use resched_curr() in task_tick_rt()
    sched: Use rq->rd in sched_setaffinity() under RCU read lock
    sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
    sched: Use dl_bw_of() under RCU read lock
    sched/fair: Remove duplicate code from can_migrate_task()
    sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
    sched: print_rq(): Don't use tasklist_lock
    sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
    sched: Fix the task-group check in tg_has_rt_tasks()
    sched/fair: Leverage the idle state info when choosing the "idlest" cpu
    sched: Let the scheduler see CPU idle states
    sched/deadline: Fix inter- exclusive cpusets migrations
    sched/deadline: Clear dl_entity params when setscheduling to different class
    sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
    ...

    Linus Torvalds
     

10 Oct, 2014

1 commit

  • 1. vma_policy_mof(task) is simply not safe unless task == current,
    it can race with do_exit()->mpol_put(). Remove this arg and update
    its single caller.

    2. vma can not be NULL, remove this check and simplify the code.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

03 Oct, 2014

2 commits

  • We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
    if this is necessary.

    Furthermore, a higher priority class task may be current on dest rq,
    we shouldn't disturb it.

    Signed-off-by: Kirill Tkhai
    Cc: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Since commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() ...")
    sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
    but is only more loaded than others. This change has been introduced to ensure
    a better load balance in system that are not overloaded but as a side effect,
    it can also generate useless active migration between groups.

    Let take the example of 3 tasks on a quad cores system. We will always have an
    idle core so the load balance will find a busiest group (core) whenever an ILB
    is triggered and it will force an active migration (once above
    nr_balance_failed threshold) so the idle core becomes busy but another core
    will become idle. With the next ILB, the freshly idle core will try to pull the
    task of a busy CPU.
    The number of spurious active migration is not so huge in quad core system
    because the ILB is not triggered so much. But it becomes significant as soon as
    you have more than one sched_domain level like on a dual cluster of quad cores
    where the ILB is triggered every tick when you have more than 1 busy_cpu

    We need to ensure that the migration generate a real improveùent and will not
    only move the avg_load imbalance on another CPU.

    Before caeb178c60f4f93f1b45c0bc056b5cf6d217b67f, the filtering of such use
    case was ensured by the following test in f_b_g:

    if ((local->idle_cpus < busiest->idle_cpus) &&
    busiest->sum_nr_running group_weight)

    This patch modified the condition to take into account situation where busiest
    group is not overloaded: If the diff between the number of idle cpus in 2
    groups is less than or equal to 1 and the busiest group is not overloaded,
    moving a task will not improve the load balance but just move it.

    A test with sysbench on a dual clusters of quad cores gives the following
    results:

    command: sysbench --test=cpu --num-threads=5 --max-time=5 run

    The HZ is 200 which means that 1000 ticks has fired during the test.

    With Mainline, perf gives the following figures:

    Samples: 727 of event 'sched:sched_migrate_task'
    Event count (approx.): 727
    Overhead Command Shared Object Symbol
    ........ ............... ............. ..............
    12.52% migration/1 [unknown] [.] 00000000
    12.52% migration/5 [unknown] [.] 00000000
    12.52% migration/7 [unknown] [.] 00000000
    12.10% migration/6 [unknown] [.] 00000000
    11.83% migration/0 [unknown] [.] 00000000
    11.83% migration/3 [unknown] [.] 00000000
    11.14% migration/4 [unknown] [.] 00000000
    10.87% migration/2 [unknown] [.] 00000000
    2.75% sysbench [unknown] [.] 00000000
    0.83% swapper [unknown] [.] 00000000
    0.55% ktps65090charge [unknown] [.] 00000000
    0.41% mmcqd/1 [unknown] [.] 00000000
    0.14% perf [unknown] [.] 00000000

    With this patch, perf gives the following figures

    Samples: 20 of event 'sched:sched_migrate_task'
    Event count (approx.): 20
    Overhead Command Shared Object Symbol
    ........ ............... ............. ..............
    80.00% sysbench [unknown] [.] 00000000
    10.00% swapper [unknown] [.] 00000000
    5.00% ktps65090charge [unknown] [.] 00000000
    5.00% migration/1 [unknown] [.] 00000000

    Signed-off-by: Vincent Guittot
    Reviewed-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

24 Sep, 2014

3 commits

  • Combine two branches which do the same.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140922183612.11015.64200.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • The code in find_idlest_cpu() looks for the CPU with the smallest load.
    However, if multiple CPUs are idle, the first idle CPU is selected
    irrespective of the depth of its idle state.

    Among the idle CPUs we should pick the one with with the shallowest idle
    state, or the latest to have gone idle if all idle CPUs are in the same
    state. The later applies even when cpuidle is configured out.

    This patch doesn't cover the following issues:

    - The idle exit latency of a CPU might be larger than the time needed
    to migrate the waking task to an already running CPU with sufficient
    capacity, and therefore performance would benefit from task packing
    in such case (in most cases task packing is about power saving).

    - Some idle states have a non negligible and non abortable entry latency
    which needs to run to completion before the exit latency can start.
    A concurrent patch series is making this info available to the cpuidle
    core. Once available, the entry latency with the idle timestamp could
    determine when the exit latency may be effective.

    Those issues will be handled in due course. In the mean time, what
    is implemented here should improve things already compared to the current
    state of affairs.

    Based on an initial patch from Daniel Lezcano.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Daniel Lezcano
    Cc: "Rafael J. Wysocki"
    Cc: Linus Torvalds
    Cc: linux-pm@vger.kernel.org
    Cc: linaro-kernel@lists.linaro.org
    Link: http://lkml.kernel.org/n/tip-@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • current->state == TASK_DEAD means that the task is doing its
    last schedule(), page fault is obviously impossible at this
    stage.

    Signed-off-by: Oleg Nesterov
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140921194743.GA30114@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

21 Sep, 2014

1 commit


19 Sep, 2014

7 commits

  • Currently the task always wakes affine on this_cpu if the latter is idle.
    Before waking up the task on this_cpu, we check that this_cpu capacity is not
    significantly reduced because of RT tasks or irq activity.

    Use case where the number of irq and/or the time spent under irq is important
    will take benefit of this because the task that is woken up by irq or softirq
    will not use the same CPU than irq (and softirq) but a idle one.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: preeti@linux.vnet.ibm.com
    Cc: riel@redhat.com
    Cc: Morten.Rasmussen@arm.com
    Cc: efault@gmx.de
    Cc: nicolas.pitre@linaro.org
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409051215-16788-8-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • 'capacity_orig' is only changed for systems with an SMT sched_domain level in order
    to reflect the lower capacity of CPUs. Heterogenous systems also have to reflect an
    original capacity that is different from the default value.

    Create a more generic function arch_scale_cpu_capacity that can be also used by
    non SMT platforms to set capacity_orig.

    The __weak implementation of arch_scale_cpu_capacity() is the previous SMT variant,
    in order to keep backward compatibility with the use of capacity_orig.

    arch_scale_smt_capacity() and default_scale_smt_capacity() have been removed as
    they were not used elsewhere than in arch_scale_cpu_capacity().

    Signed-off-by: Vincent Guittot
    Reviewed-by: Kamalesh Babulal
    Reviewed-by: Preeti U. Murthy
    [ Added default_scale_cpu_capacity() back. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: riel@redhat.com
    Cc: Morten.Rasmussen@arm.com
    Cc: efault@gmx.de
    Cc: nicolas.pitre@linaro.org
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409051215-16788-5-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • The computation of avg_load and avg_load_per_task should only take into
    account the number of CFS tasks. The non-CFS tasks are already taken into
    account by decreasing the CPU's capacity and they will be tracked in the
    CPU's utilization (group_utilization) of the next patches.

    Reviewed-by: Preeti U Murthy
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: riel@redhat.com
    Cc: Morten.Rasmussen@arm.com
    Cc: efault@gmx.de
    Cc: nicolas.pitre@linaro.org
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409051215-16788-4-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • In wake_affine() I have tried to understand the meaning of the condition:

    (this_load
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: preeti@linux.vnet.ibm.com
    Cc: riel@redhat.com
    Cc: Morten.Rasmussen@arm.com
    Cc: efault@gmx.de
    Cc: nicolas.pitre@linaro.org
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409051215-16788-3-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • The imbalance flag can stay set whereas there is no imbalance.

    Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
    We will have some idle load balance which are triggered during tick.
    Unfortunately, the tick is also used to queue background work so we can reach
    the situation where short work has been queued on a CPU which already runs a
    task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
    CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
    a worker thread that is pinned on a CPU so an imbalance due to pinned task is
    detected and the imbalance flag is set.

    Then, we will not be able to clear the flag because we have at most 1 task on
    each CPU but the imbalance flag will trig to useless active load balance
    between the idle CPU and the busy CPU.

    We need to reset of the imbalance flag as soon as we have reached a balanced
    state. If all tasks are pinned, we don't consider that as a balanced state and
    let the imbalance flag set.

    Signed-off-by: Vincent Guittot
    Reviewed-by: Preeti U Murthy
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: riel@redhat.com
    Cc: Morten.Rasmussen@arm.com
    Cc: efault@gmx.de
    Cc: nicolas.pitre@linaro.org
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409051215-16788-2-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • new_cpu is reassigned below, so we do not need this here.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1410529276.3569.24.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • The code in task_numa_compare() will only examine at most one idle CPU per node,
    because they all have the same score. However, some idle CPUs are better
    candidates than others, due to busy or idle SMT siblings, etc...

    The scheduler has logic to find the best CPU within an LLC to place a
    task. The NUMA code should probably use it.

    This seems to reduce the standard deviation for single instance SPECjbb2005
    with a low warehouse count on my 4 node test system.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: mgorman@suse.de
    Cc: Mike Galbraith
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140904163530.189d410a@cuia.bos.redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

09 Sep, 2014

1 commit

  • When running workloads on 2+ socket systems, based on perf profiles, the
    update_cfs_rq_blocked_load() function often shows up as taking up a
    noticeable % of run time.

    Much of the contention is in __update_cfs_rq_tg_load_contrib() when we
    update the tg load contribution stats. However, it turns out that in many
    cases, they don't need to be updated and "tg_contrib" is 0.

    This patch adds a check in __update_cfs_rq_tg_load_contrib() to skip updating
    tg load contribution stats when nothing needs to be updated. This reduces the
    cacheline contention that would be unnecessary.

    Reviewed-by: Ben Segall
    Reviewed-by: Waiman Long
    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Paul Turner
    Cc: jason.low2@hp.com
    Cc: Yuyang Du
    Cc: Aswin Chandramouleeswaran
    Cc: Chegu Vinod
    Cc: Scott J Norton
    Cc: Tim Chen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409643684.19197.15.camel@j-VirtualBox
    Signed-off-by: Ingo Molnar

    Jason Low
     

07 Sep, 2014

1 commit

  • An overrun could happen in function start_hrtick_dl()
    when a task with SCHED_DEADLINE runs in the microseconds
    range.

    For example, if a task with SCHED_DEADLINE has the following parameters:

    Task runtime deadline period
    P1 200us 500us 500us

    The deadline and period from task P1 are less than 1ms.

    In order to achieve microsecond precision, we need to enable HRTICK feature
    by the next command:

    PC#echo "HRTICK" > /sys/kernel/debug/sched_features
    PC#trace-cmd record -e sched_switch &
    PC#./schedtool -E -t 200000:500000:500000 -e ./test

    The binary test is in an endless while(1) loop here.
    Some pieces of trace.dat are as follows:

    -0 157.603157: sched_switch: :R ==> 2481:4294967295: test
    test-2481 157.603203: sched_switch: 2481:R ==> 0:120: swapper/2
    -0 157.605657: sched_switch: :R ==> 2481:4294967295: test
    test-2481 157.608183: sched_switch: 2481:R ==> 2483:120: trace-cmd
    trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test

    We can get the runtime of P1 from the information above:

    runtime = 157.608183 - 157.605657
    runtime = 0.002526(2.526ms)

    The correct runtime should be less than or equal to 200us at some point.

    The problem is caused by a conditional judgment "delta > 10000"
    in function start_hrtick_dl().

    Because no hrtimer start up to control the rest of runtime
    when the reset of runtime is less than 10us.

    So the process will continue to run until tick-period is coming.

    Move the code with the limit of the least time slice
    from hrtick_start_fair() to hrtick_start() because the
    EDF schedule class also needs this function in start_hrtick_dl().

    To fix this problem, we call hrtimer_start() unconditionally in
    start_hrtick_dl(), and make sure the scheduling slice won't be smaller
    than 10us in hrtimer_start().

    Signed-off-by: Xiaofeng Yan
    Reviewed-by: Li Zefan
    Acked-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
    [ Massaged the changelog and the code. ]
    Signed-off-by: Ingo Molnar

    xiaofeng.yan
     

05 Sep, 2014

1 commit

  • The use of "rcu_assign_pointer()" is NULLing out the pointer.
    According to RCU_INIT_POINTER()'s block comment:

    "1. This use of RCU_INIT_POINTER() is NULLing out the pointer"

    it is better to use it instead of rcu_assign_pointer() because it has a
    smaller overhead.

    The following Coccinelle semantic patch was used:
    @@
    @@

    - rcu_assign_pointer
    + RCU_INIT_POINTER
    (..., NULL)

    Signed-off-by: Andreea-Cristina Bernat
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: paulmck@linux.vnet.ibm.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140822145043.GA580@ada
    Signed-off-by: Ingo Molnar

    Andreea-Cristina Bernat
     

28 Aug, 2014

1 commit

  • __get_cpu_var can paper over differences in the definitions of
    cpumask_var_t and either use the address of the cpumask variable
    directly or perform a fetch of the address of the struct cpumask
    allocated elsewhere. This is important particularly when using per cpu
    cpumask_var_t declarations because in one case we have an offset into
    a per cpu area to handle and in the other case we need to fetch a
    pointer from the offset.

    This patch introduces a new macro

    this_cpu_cpumask_var_ptr()

    that is defined where cpumask_var_t is defined and performs the proper
    actions. All use cases where __get_cpu_var is used with cpumask_var_t
    are converted to the use of this_cpu_cpumask_var_ptr().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

20 Aug, 2014

3 commits

  • Avoid double_rq_lock() and use TASK_ON_RQ_MIGRATING for
    load_balance(). The advantage is (obviously) not holding two
    rq->lock's at the same time and thereby increasing parallelism.

    Further note that if there was no task to migrate we will not
    have acquired the second rq->lock at all.

    The important point to note is that because we acquire dst->lock
    immediately after releasing src->lock the potential wait time of
    task_rq_lock() callers on TASK_ON_RQ_MIGRATING is not longer
    than it would have been in the double rq lock scenario.

    Signed-off-by: Kirill Tkhai
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Oleg Nesterov
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Kirill Tkhai
    Cc: Tim Chen
    Cc: Nicolas Pitre
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1408528109.23412.94.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Avoid double_rq_lock() and use the TASK_ON_RQ_MIGRATING state for
    active_load_balance_cpu_stop(). The advantage is (obviously) not
    holding two 'rq->lock's at the same time and thereby increasing
    parallelism.

    Further note that if there was no task to migrate we will not
    have acquired the second rq->lock at all.

    The important point to note is that because we acquire dst->lock
    immediately after releasing src->lock the potential wait time of
    task_rq_lock() callers on TASK_ON_RQ_MIGRATING is not longer
    than it would have been in the double rq lock scenario.

    Signed-off-by: Kirill Tkhai
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Oleg Nesterov
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Kirill Tkhai
    Cc: Tim Chen
    Cc: Nicolas Pitre
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1408528081.23412.92.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Implement task_on_rq_queued() and use it everywhere instead of
    on_rq check. No functional changes.

    The only exception is we do not use the wrapper in
    check_for_tasks(), because it requires to export
    task_on_rq_queued() in global header files. Next patch in series
    would return it back, so we do not twist it from here to there.

    Signed-off-by: Kirill Tkhai
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Oleg Nesterov
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Kirill Tkhai
    Cc: Tim Chen
    Cc: Nicolas Pitre
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai