16 Nov, 2014

11 commits

  • Actually, cpupri_set() and cpupri_init() can never be used without
    CONFIG_SMP.

    Signed-off-by: pang.xunlei
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Steven Rostedt
    Cc: Juri Lelli
    Cc: "pang.xunlei"
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415260327-30465-1-git-send-email-pang.xunlei@linaro.org
    Signed-off-by: Ingo Molnar

    pang.xunlei
     
  • Do not call dequeue_pushable_dl_task() when failing to push an eligible
    task, as it remains pushable, merely not at this particular moment.

    Actually the patch is the same behavior as commit 311e800e16f6 ("sched,
    rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return
    'true' on a busier sd") changes groups to be ranked in the order of
    overloaded > imbalance > other, and busiest group is picked according
    to this order.

    sgs->group_capacity_factor is used to check if the group is overloaded.

    When the child domain prefers tasks to go to siblings first, the
    sgs->group_capacity_factor will be set lower than one in order to
    move all the excess tasks away.

    However, group overloaded status is not updated when
    sgs->group_capacity_factor is set to lower than one, which leads to us
    missing to find the busiest group.

    This patch fixes it by updating group overloaded status when sg capacity
    factor is set to one, in order to find the busiest group accurately.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Vincent Guittot
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
    [ Fixed the changelog. ]
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
    This change will make fair.c, rt.c, and deadline.c all start with the
    same logic.

    Suggested-and-Acked-by: Steven Rostedt
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "pang.xunlei"
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • As discussed in [1], accounting IO is meant for blkio only. Document that
    so driver authors won't use them for device io.

    [1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470

    Signed-off-by: Wolfram Sang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: One Thousand Gnomes
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.de
    Signed-off-by: Ingo Molnar

    Wolfram Sang
     
  • Remove question mark:

    s/New utsname group?/New utsname namespace

    Unified style for IPC:

    s/New ipcs/New ipc namespace

    Signed-off-by: Chen Hanxiao
    Acked-by: Serge E. Hallyn
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jiri Kosina
    Cc: Linus Torvalds
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/1415091082-15093-1-git-send-email-chenhanxiao@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Chen Hanxiao
     
  • Nobody iterates over numa_group::task_list, this just confuses the readers.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
    test case in cost of breaking another one. After that commit, calling
    clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
    of Y time being smaller than X time.

    Reproducer/tester can be found further below, it can be compiled and ran by:

    gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
    while ./tst-cpuclock2 ; do : ; done

    This reproducer, when running on a buggy kernel, will complain
    about "clock_gettime difference too small".

    Issue happens because on start in thread_group_cputimer() we initialize
    sum_exec_runtime of cputimer with threads runtime not yet accounted and
    then add the threads runtime to running cputimer again on scheduler
    tick, making it's sum_exec_runtime bigger than actual threads runtime.

    KOSAKI Motohiro posted a fix for this problem, but that patch was never
    applied: https://lkml.org/lkml/2013/5/26/191 .

    This patch takes different approach to cure the problem. It calls
    update_curr() when cputimer starts, that assure we will have updated
    stats of running threads and on the next schedule tick we will account
    only the runtime that elapsed from cputimer start. That also assure we
    have consistent state between cpu times of individual threads and cpu
    time of the process consisted by those threads.

    Full reproducer (tst-cpuclock2.c):

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /* Parameters for the Linux kernel ABI for CPU clocks. */
    #define CPUCLOCK_SCHED 2
    #define MAKE_PROCESS_CPUCLOCK(pid, clock) \
    ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

    static pthread_barrier_t barrier;

    /* Help advance the clock. */
    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1) ;

    return NULL;
    }

    /* Don't use the glibc wrapper. */
    static int do_nanosleep(int flags, const struct timespec *req)
    {
    clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

    return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
    }

    static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
    {
    int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
    int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

    return after_i - before_i;
    }

    int main(void)
    {
    int result = 0;
    pthread_t th;

    pthread_barrier_init(&barrier, NULL, 2);

    if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
    perror("pthread_create");
    return 1;
    }

    pthread_barrier_wait(&barrier);

    /* The test. */
    struct timespec before, after, sleeptimeabs;
    int64_t sleepdiff, diffabs;
    const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

    /* The relative nanosleep. Not sure why this is needed, but its presence
    seems to make it easier to reproduce the problem. */
    if (do_nanosleep(0, &sleeptime) != 0) {
    perror("clock_nanosleep");
    return 1;
    }

    /* Get the current time. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
    perror("clock_gettime[2]");
    return 1;
    }

    /* Compute the absolute sleep time based on the current time. */
    uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
    sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
    sleeptimeabs.tv_nsec = nsec % 1000000000;

    /* Sleep for the computed time. */
    if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
    perror("absolute clock_nanosleep");
    return 1;
    }

    /* Get the time after the sleep. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
    perror("clock_gettime[3]");
    return 1;
    }

    /* The time after sleep should always be equal to or after the absolute sleep
    time passed to clock_nanosleep. */
    sleepdiff = tsdiff(&sleeptimeabs, &after);
    if (sleepdiff < 0) {
    printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
    result = 1;

    printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
    printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
    printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
    }

    /* The difference between the timestamps taken before and after the
    clock_nanosleep call should be equal to or more than the duration of the
    sleep. */
    diffabs = tsdiff(&before, &after);
    if (diffabs < sleeptime.tv_nsec) {
    printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
    result = 1;
    }

    pthread_cancel(th);

    return result;
    }

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • While looking over the cpu-timer code I found that we appear to add
    the delta for the calling task twice, through:

    cpu_timer_sample_group()
    thread_group_cputimer()
    thread_group_cputime()
    times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

    Which would make the sample run ahead, making the sleep short.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Stanislaw Gruszka
    Cc: Christoph Lameter
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Because the whole numa task selection stuff runs with preemption
    enabled (its long and expensive) we can end up migrating and selecting
    oneself as a swap target. This doesn't really work out well -- we end
    up trying to acquire the same lock twice for the swap migrate -- so
    avoid this.

    Reported-and-Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Nov, 2014

1 commit

  • On latest mm + KASan patchset I've got this:

    ==================================================================
    BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
    =============================================================================
    BUG kmalloc-8 (Not tainted): kasan error
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
    __slab_alloc+0x4b4/0x4f0
    __kmalloc_track_caller+0x15f/0x1e0
    kstrdup+0x44/0x90
    alloc_vfsmnt+0xb0/0x2c0
    vfs_kern_mount+0x35/0x190
    kern_mount_data+0x25/0x50
    pid_ns_prepare_proc+0x19/0x50
    alloc_pid+0x5e2/0x630
    copy_process.part.41+0xdf5/0x2aa0
    do_fork+0xf5/0x460
    kernel_thread+0x21/0x30
    rest_init+0x1e/0x90
    start_kernel+0x522/0x531
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x15b/0x16a
    INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
    INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70

    Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
    Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk.
    Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........
    Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
    CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
    ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
    ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    print_trailer (mm/slub.c:645)
    object_err (mm/slub.c:652)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_kmalloc (mm/kasan/kasan.c:311)
    __asan_load4 (mm/kasan/kasan.c:371)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kernel_init_freeable (init/main.c:869 init/main.c:997)
    ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
    ? rest_init (init/main.c:924)
    kernel_init (init/main.c:929)
    ? rest_init (init/main.c:924)
    ret_from_fork (arch/x86/kernel/entry_64.S:348)
    ? rest_init (init/main.c:924)
    Read of size 4 by task swapper/0:
    Memory state around the buggy address:
    ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
    ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
    ^
    ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Zero 'level' (e.g. on non-NUMA system) causing out of bounds
    access in this line:

    sched_max_numa_distance = sched_domains_numa_distance[level - 1];

    Fix this by exiting from sched_init_numa() earlier.

    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     

04 Nov, 2014

23 commits

  • This patch simplifies task_struct by removing the four numa_* pointers
    in the same array and replacing them with the array pointer. By doing this,
    on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
    x86_64).

    A new parameter is added to the task_faults_idx function so that it can return
    an index to the correct offset, corresponding with the old precalculated
    pointers.

    All of the code in sched/ that depended on task_faults_idx and numa_* was
    changed in order to match the new logic.

    Signed-off-by: Iulia Manda
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: mgorman@suse.de
    Cc: dave@stgolabs.net
    Cc: riel@redhat.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
    Signed-off-by: Ingo Molnar

    Iulia Manda
     
  • There are both UP and SMP version of pull_dl_task(), so don't need
    to check CONFIG_SMP in switched_from_dl();

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • In switched_from_dl() we have to issue a resched if we successfully
    pulled some task from other cpus. This patch also aligns the behavior
    with -rt.

    Suggested-by: Juri Lelli
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • This patch pushes task away if the dealine of the task is equal
    to current during wake up. The same behavior as rt class.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • This patch add deadline rq status print.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • The yield semantic of deadline class is to reduce remaining runtime to
    zero, and then update_curr_dl() will stop it. However, comsumed bandwidth
    is reduced from the budget of yield task again even if it has already been
    set to zero which leads to artificial overrun. This patch fix it by make
    sure we don't steal some more time from the task that yielded in update_curr_dl().

    Suggested-by: Juri Lelli
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • This patch checks if current can be pushed/pulled somewhere else
    in advance to make logic clear, the same behavior as dl class.

    - If current can't be migrated, useless to reschedule, let's hope
    task can move out.
    - If task is migratable, so let's not schedule it and see if it
    can be pushed or pulled somewhere else.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Steven Rostedt
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • As per commit f10e00f4bf36 ("sched/dl: Use dl_bw_of() under
    rcu_read_lock_sched()"), dl_bw_of() has to be protected by
    rcu_read_lock_sched().

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414497286-28824-1-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu
    after we have found an idle cpu.

    Signed-off-by: Yao Dongdong
    Reviewed-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.com
    Signed-off-by: Ingo Molnar

    Yao Dongdong
     
  • Currently used hrtimer_try_to_cancel() is racy:

    raw_spin_lock(&rq->lock)
    ... dl_task_timer raw_spin_lock(&rq->lock)
    ... raw_spin_lock(&rq->lock) ...
    switched_from_dl() ... ...
    hrtimer_try_to_cancel() ... ...
    switched_to_fair() ... ...
    ... ... ...
    ... ... ...
    raw_spin_unlock(&rq->lock) ... (asquired)
    ... ... ...
    ... ... ...
    do_exit() ... ...
    schedule() ... ...
    raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock)
    ... ... ...
    raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock)
    ... ... (asquired)
    put_task_struct() ... ...
    free_task_struct() ... ...
    ... ... raw_spin_unlock(&rq->lock)
    ... (asquired) ...
    ... ... ...
    ... (use after free) ...

    So, let's implement 100% guaranteed way to cancel the timer and let's
    be sure we are safe even in very unlikely situations.

    rq unlocking does not limit the area of switched_from_dl() use, because
    this has already been possible in pull_dl_task() below.

    Let's consider the safety of of this unlocking. New code in the patch
    is working when hrtimer_try_to_cancel() fails. This means the callback
    is running. In this case hrtimer_cancel() is just waiting till the
    callback is finished. Two

    1) Since we are in switched_from_dl(), new class is not dl_sched_class and
    new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
    right after !dl_task() check. After that hrtimer_cancel() returns back too.

    The above is:

    raw_spin_lock(rq->lock); ...
    ... dl_task_timer()
    ... raw_spin_lock(rq->lock);
    switched_from_dl() ...
    hrtimer_try_to_cancel() ...
    raw_spin_unlock(rq->lock); ...
    hrtimer_cancel() ...
    ... raw_spin_unlock(rq->lock);
    ... return HRTIMER_NORESTART;
    ... ...
    raw_spin_lock(rq->lock); ...

    2) But the below is also possible:
    dl_task_timer()
    raw_spin_lock(rq->lock);
    ...
    raw_spin_unlock(rq->lock);
    raw_spin_lock(rq->lock); ...
    switched_from_dl() ...
    hrtimer_try_to_cancel() ...
    ... return HRTIMER_NORESTART;
    raw_spin_unlock(rq->lock); ...
    hrtimer_cancel(); ...
    raw_spin_lock(rq->lock); ...

    In this case hrtimer_cancel() returns immediately. Very unlikely case,
    just to mention.

    Nobody can manipulate the task, because check_class_changed() is
    always called with pi_lock locked. Nobody can force the task to
    participate in (concurrent) priority inheritance schemes (the same reason).

    All concurrent task operations require pi_lock, which is held by us.
    No deadlocks with dl_task_timer() are possible, because it returns
    right after !dl_task() check (it does nothing).

    If we receive a new dl_task during the time of unlocked rq, we just
    don't have to do pull_dl_task() in switched_from_dl() further.

    Signed-off-by: Kirill Tkhai
    [ Added comments]
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • In some cases this can trigger a true flood of output.

    Requested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • rtnl_lock_unregistering*() take rtnl_lock() -- a mutex -- inside a
    wait loop. The wait loop relies on current->state to function, but so
    does mutex_lock(), nesting them makes for the inner to destroy the
    outer state.

    Fix this using the new wait_woken() bits.

    Reported-by: Fengguang Wu
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: David S. Miller
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: David Gibson
    Cc: Eric Biederman
    Cc: Eric Dumazet
    Cc: Jamal Hadi Salim
    Cc: Jerry Chu
    Cc: Jiri Pirko
    Cc: John Fastabend
    Cc: Linus Torvalds
    Cc: Nicolas Dichtel
    Cc: sfeldma@cumulusnetworks.com
    Cc: stephen hemminger
    Cc: Tom Gundersen
    Cc: Tom Herbert
    Cc: Veaceslav Falico
    Cc: Vlad Yasevich
    Cc: netdev@vger.kernel.org
    Link: http://lkml.kernel.org/r/20141029173110.GE15602@worktop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • rfcomm_run() is a tad broken in that is has a nested wait loop. One
    cannot rely on p->state for the outer wait because the inner wait will
    overwrite it.

    Fix this using the new wait_woken() facility.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Peter Hurley
    Cc: Alexander Holler
    Cc: David S. Miller
    Cc: Gustavo Padovan
    Cc: Joe Perches
    Cc: Johan Hedberg
    Cc: Libor Pechacek
    Cc: Linus Torvalds
    Cc: Marcel Holtmann
    Cc: Seung-Woo Kim
    Cc: Vignesh Raman
    Cc: linux-bluetooth@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The kauditd_thread wait loop is a bit iffy; it has a number of problems:

    - calls try_to_freeze() before schedule(); you typically want the
    thread to re-evaluate the sleep condition when unfreezing, also
    freeze_task() issues a wakeup.

    - it unconditionally does the {add,remove}_wait_queue(), even when the
    sleep condition is false.

    Use wait_event_freezable() that does the right thing.

    Reported-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Eric Paris
    Cc: oleg@redhat.com
    Cc: Eric Paris
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141002102251.GA6324@worktop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There is no user.. make it go away.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: oleg@redhat.com
    Cc: Rafael Wysocki
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Pavel Machek
    Cc: linux-pm@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • Provide better implementations of wait_event_freezable() APIs.

    The problem is with freezer_do_not_count(), it hides the thread from
    the freezer, even though this thread might not actually freeze/sleep
    at all.

    Cc: oleg@redhat.com
    Cc: Rafael Wysocki

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Pavel Machek
    Cc: Rafael J. Wysocki
    Cc: linux-pm@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-d86fz1jmso9wjxa8jfpinp8o@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There is a race between kthread_stop() and the new wait_woken() that
    can result in a lack of progress.

    CPU 0 | CPU 1
    |
    rfcomm_run() | kthread_stop()
    ... |
    if (!test_bit(KTHREAD_SHOULD_STOP)) |
    | set_bit(KTHREAD_SHOULD_STOP)
    | wake_up_process()
    wait_woken() | wait_for_completion()
    set_current_state(INTERRUPTIBLE) |
    if (!WQ_FLAG_WOKEN) |
    schedule_timeout() |
    |

    After which both tasks will wait.. forever.

    Fix this by having wait_woken() check for kthread_should_stop() but
    only for kthreads (obviously).

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Peter Hurley
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • sched_move_task() is the only interface to change sched_task_group:
    cpu_cgrp_subsys methods and autogroup_move_group() use it.

    Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
    is ordered with other users of sched_move_task(). This means we do no
    need RCU here: if we've dereferenced a tg here, the .attach method
    hasn't been called for it yet.

    Thus, we should pass "true" to task_css_check() to silence lockdep
    warnings.

    Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
    Reported-by: Oleg Nesterov
    Reported-by: Fengguang Wu
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Pull pin-control fixes from Linus Walleij:
    "This kernel cycle has been calm for both pin control and GPIO so far
    but here are three pin control patches for you anyway, only really
    dealing with Baytrail:

    - Two fixes for the Baytrail driver affecting IRQs and output state
    in sysfs
    - Use the linux-gpio mailing list also for pinctrl patches"

    * tag 'pinctrl-v3.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
    pinctrl: baytrail: show output gpio state correctly on Intel Baytrail
    pinctrl: use linux-gpio mailing list
    pinctrl: baytrail: Clear DIRECT_IRQ bit

    Linus Torvalds
     
  • Pull CMA and DMA-mapping fixes from Marek Szyprowski:
    "This contains important fixes for recently introduced highmem support
    for default contiguous memory region used for dma-mapping subsystem"

    * 'fixes-for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
    mm, cma: make parameters order consistent in func declaration and definition
    mm: cma: Use %pa to print physical addresses
    mm: cma: Ensure that reservations never cross the low/high mem boundary
    mm: cma: Always consider a 0 base address reservation as dynamic
    mm: cma: Don't crash on allocation if CMA area can't be activated

    Linus Torvalds
     
  • Pull ceph fixes from Sage Weil:
    "There is a GFP flag fix from Mike Christie, an error code fix from
    Jan, and fixes for two unnecessary allocations (kmalloc and workqueue)
    from Ilya. All are well tested.

    Ilya has one other fix on the way but it didn't get tested in time"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    libceph: eliminate unnecessary allocation in process_one_ticket()
    rbd: Fix error recovery in rbd_obj_read_sync()
    libceph: use memalloc flags for net IO
    rbd: use a single workqueue for all devices

    Linus Torvalds
     
  • Pull m68k update from Geert Uytterhoeven.

    Just wiring up the bpf system call.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
    m68k: Wire up bpf

    Linus Torvalds
     
  • Pull ARM SoC fixes from Olof Johansson:
    "A surprisingly small batch of fixes for -rc3. Suspiciously small, I'd
    say.

    Anyway, most of this are a few defconfig updates. Some for omap to
    deal with kernel binary size (moving ipv6 to module, etc). A larger
    one for socfpga that refreshes with some churn, but also turns on a
    few options that makes the newly-added board in my bootfarm usable for
    testing.

    OMAP3 will also now warn when booted with legacy (non-DT) boot
    protocols, hopefully encouraging those who still care about some of
    those platforms to submit DT support and report bugs where needed.
    Nothing stops working though, this is just to warn for future
    deprecation.

    Beyond this, very few actual bugfixes. A PXA fix for DEBUG_LL boot
    hangs, a missing terminting entry in a dt_match array on RealView a
    MTD fix on OMAP with NAND"

    [ Obviously missed rc3, will make rc4 instead ;) ]

    * tag 'armsoc-for-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
    MAINTAINERS: drop list entry for davinci
    ARM: OMAP2+: Warn about deprecated legacy booting mode
    ARM: omap2plus_defconfig: Fix errors with NAND BCH
    ARM: multi_v7_defconfig: fix support for APQ8084
    soc: versatile: Add terminating entry for realview_soc_of_match
    ARM: ixp4xx: remove compilation warnings in io.h
    MAINTAINERS: Add Soren as reviewer for Zynq
    ARM: omap2plus_defconfig: Fix bloat caused by having ipv6 built-in
    ARM: socfpga_defconfig: Update defconfig for SoCFPGA
    ARM: pxa: fix hang on startup with DEBUG_LL

    Linus Torvalds
     

03 Nov, 2014

5 commits

  • Linus Torvalds
     
  • Pull MTD fixes from Brian Norris:
    "Three main MTD fixes for 3.18:

    - A regression from 3.16 which was noticed in 3.17. With the
    restructuring of the m25p80.c driver and the SPI NOR library
    framework, we omitted proper listing of the SPI device IDs. This
    means m25p80.c wouldn't auto-load (modprobe) properly when built as
    a module. For now, we duplicate the device IDs into both modules.

    - The OMAP / ELM modules were depending on an implicit link ordering.
    Use deferred probing so that the new link order (in 3.18-rc) can
    still allow for successful probing.

    - Fix suspend/resume support for LH28F640BF NOR flash"

    * tag 'for-linus-20141102' of git://git.infradead.org/linux-mtd:
    mtd: cfi_cmdset_0001.c: fix resume for LH28F640BF chips
    mtd: omap: fix mtd devices not showing up
    mtd: m25p80,spi-nor: Fix module aliases for m25p80
    mtd: spi-nor: make spi_nor_scan() take a chip type name, not spi_device_id
    mtd: m25p80: get rid of spi_get_device_id

    Linus Torvalds
     
  • Pull SCSI fixes from James Bottomley:
    "This is a set of six patches consisting of:
    - two MAINTAINER updates
    - two scsi-mq fixs for the old parallel interface (not every request
    is tagged and we need to set the right flags to populate the SPI
    tag message)
    - a fix for a memory leak in scatterlist traversal caused by a
    preallocation update in 3.17
    - an ipv6 fix for cxgbi"

    [ The scatterlist fix also came in separately through the block layer tree ]

    * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    MAINTAINERS: ufs - remove self
    MAINTAINERS: change hpsa and cciss maintainer
    libcxgbi : support ipv6 address host_param
    scsi: set REQ_QUEUE for the blk-mq case
    Revert "block: all blk-mq requests are tagged"
    lib/scatterlist: fix memory leak with scsi-mq

    Linus Torvalds
     
  • Pull drm fixes from Dave Airlie:
    "Nothing too astounding or major: radeon, i915, vmwgfx, armada and
    exynos.

    Biggest ones:
    - vmwgfx has one big locking regression fix
    - i915 has come displayport fixes
    - radeon has some stability and a memory alloc failure
    - armada and exynos have some vblank fixes"

    * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (24 commits)
    drm/exynos: correct connector->dpms field before resuming
    drm/exynos: enable vblank after DPMS on
    drm/exynos: init kms poll at the end of initialization
    drm/exynos: propagate plane initialization errors
    drm/exynos: vidi: fix build warning
    drm/exynos: remove explicit encoder/connector de-initialization
    drm/exynos: init vblank with real number of crtcs
    drm/vmwgfx: Filter out modes those cannot be supported by the current VRAM size.
    drm/vmwgfx: Fix hash key computation
    drm/vmwgfx: fix lock breakage
    drm/i915/dp: only use training pattern 3 on platforms that support it
    drm/radeon: remove some buggy dead code
    drm/i915: Ignore VBT backlight check on Macbook 2, 1
    drm/radeon: remove invalid pci id
    drm/radeon: dpm fixes for asrock systems
    radeon: clean up coding style differences in radeon_get_bios()
    drm/radeon: Use drm_malloc_ab instead of kmalloc_array
    drm/radeon/dpm: disable ulv support on SI
    drm/i915: Fix GMBUSFREQ on vlv/chv
    drm/i915: Ignore long hpds on eDP ports
    ...

    Linus Torvalds
     
  • …/git/tmlind/linux-omap into fixes

    Merge "omap fixes against v3.18-rc2" from Tony Lindgren:

    Few fixes for omaps to enable NAND BCH so devices won't
    produce errors when booted with omap2plus_defconfig, and
    reduce bloat by making IPV6 a loadable module.

    Also let's add a warning about legacy boot being deprecated
    for omap3.

    We now have things working with device tree, and only omap3 is
    still booting in legacy mode. So hopefully this warning will
    help move the remaining legacy mode users to boot with device
    tree.

    As the total reduction of code and static data is somewhere
    around 20000 lines of code once we remove omap3 legacy mode
    booting, we really do want to make omap3 to boot also in
    device tree mode only over the next few merge cycles.

    * tag 'fixes-against-v3.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: (407 commits)
    ARM: OMAP2+: Warn about deprecated legacy booting mode
    ARM: omap2plus_defconfig: Fix errors with NAND BCH
    ARM: omap2plus_defconfig: Fix bloat caused by having ipv6 built-in
    + Linux 3.18-rc2

    Signed-off-by: Olof Johansson <olof@lixom.net>

    Olof Johansson