24 Nov, 2014

2 commits

  • x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
    not on non-paranoid returns. I suspect that this is a mistake and that
    the code only works because int3 is paranoid.

    Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
    for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
    from the uprobes code.

    Reported-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Borislav Petkov
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Chris bisected a NULL pointer deference in task_sched_runtime() to
    commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
    inconsistency'.

    Chris observed crashes in atop or other /proc walking programs when he
    started fork bombs on his machine. He assumed that this is a new exit
    race, but that does not make any sense when looking at that commit.

    What's interesting is that, the commit provides update_curr callbacks
    for all scheduling classes except stop_task and idle_task.

    While nothing can ever hit that via the clock_nanosleep() and
    clock_gettime() interfaces, which have been the target of the commit in
    question, the author obviously forgot that there are other code paths
    which invoke task_sched_runtime()

    do_task_stat(()
    thread_group_cputime_adjusted()
    thread_group_cputime()
    task_cputime()
    task_sched_runtime()
    if (task_current(rq, p) && task_on_rq_queued(p)) {
    update_rq_clock(rq);
    up->sched_class->update_curr(rq);
    }

    If the stats are read for a stomp machine task, aka 'migration/N' and
    that task is current on its cpu, this will happily call the NULL pointer
    of stop_task->update_curr. Ooops.

    Chris observation that this happens faster when he runs the fork bomb
    makes sense as the fork bomb will kick migration threads more often so
    the probability to hit the issue will increase.

    Add the missing update_curr callbacks to the scheduler classes stop_task
    and idle_task. While idle tasks cannot be monitored via /proc we have
    other means to hit the idle case.

    Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
    Reported-by: Chris Mason
    Reported-and-tested-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Stanislaw Gruszka
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

22 Nov, 2014

2 commits


16 Nov, 2014

4 commits

  • Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
    test case in cost of breaking another one. After that commit, calling
    clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
    of Y time being smaller than X time.

    Reproducer/tester can be found further below, it can be compiled and ran by:

    gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
    while ./tst-cpuclock2 ; do : ; done

    This reproducer, when running on a buggy kernel, will complain
    about "clock_gettime difference too small".

    Issue happens because on start in thread_group_cputimer() we initialize
    sum_exec_runtime of cputimer with threads runtime not yet accounted and
    then add the threads runtime to running cputimer again on scheduler
    tick, making it's sum_exec_runtime bigger than actual threads runtime.

    KOSAKI Motohiro posted a fix for this problem, but that patch was never
    applied: https://lkml.org/lkml/2013/5/26/191 .

    This patch takes different approach to cure the problem. It calls
    update_curr() when cputimer starts, that assure we will have updated
    stats of running threads and on the next schedule tick we will account
    only the runtime that elapsed from cputimer start. That also assure we
    have consistent state between cpu times of individual threads and cpu
    time of the process consisted by those threads.

    Full reproducer (tst-cpuclock2.c):

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /* Parameters for the Linux kernel ABI for CPU clocks. */
    #define CPUCLOCK_SCHED 2
    #define MAKE_PROCESS_CPUCLOCK(pid, clock) \
    ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

    static pthread_barrier_t barrier;

    /* Help advance the clock. */
    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1) ;

    return NULL;
    }

    /* Don't use the glibc wrapper. */
    static int do_nanosleep(int flags, const struct timespec *req)
    {
    clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

    return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
    }

    static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
    {
    int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
    int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

    return after_i - before_i;
    }

    int main(void)
    {
    int result = 0;
    pthread_t th;

    pthread_barrier_init(&barrier, NULL, 2);

    if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
    perror("pthread_create");
    return 1;
    }

    pthread_barrier_wait(&barrier);

    /* The test. */
    struct timespec before, after, sleeptimeabs;
    int64_t sleepdiff, diffabs;
    const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

    /* The relative nanosleep. Not sure why this is needed, but its presence
    seems to make it easier to reproduce the problem. */
    if (do_nanosleep(0, &sleeptime) != 0) {
    perror("clock_nanosleep");
    return 1;
    }

    /* Get the current time. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
    perror("clock_gettime[2]");
    return 1;
    }

    /* Compute the absolute sleep time based on the current time. */
    uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
    sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
    sleeptimeabs.tv_nsec = nsec % 1000000000;

    /* Sleep for the computed time. */
    if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
    perror("absolute clock_nanosleep");
    return 1;
    }

    /* Get the time after the sleep. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
    perror("clock_gettime[3]");
    return 1;
    }

    /* The time after sleep should always be equal to or after the absolute sleep
    time passed to clock_nanosleep. */
    sleepdiff = tsdiff(&sleeptimeabs, &after);
    if (sleepdiff < 0) {
    printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
    result = 1;

    printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
    printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
    printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
    }

    /* The difference between the timestamps taken before and after the
    clock_nanosleep call should be equal to or more than the duration of the
    sleep. */
    diffabs = tsdiff(&before, &after);
    if (diffabs < sleeptime.tv_nsec) {
    printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
    result = 1;
    }

    pthread_cancel(th);

    return result;
    }

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • While looking over the cpu-timer code I found that we appear to add
    the delta for the calling task twice, through:

    cpu_timer_sample_group()
    thread_group_cputimer()
    thread_group_cputime()
    times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

    Which would make the sample run ahead, making the sleep short.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Stanislaw Gruszka
    Cc: Christoph Lameter
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Because the whole numa task selection stuff runs with preemption
    enabled (its long and expensive) we can end up migrating and selecting
    oneself as a swap target. This doesn't really work out well -- we end
    up trying to acquire the same lock twice for the swap migrate -- so
    avoid this.

    Reported-and-Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When a CPU hotplugged out, we call perf_remove_from_context() (via
    perf_event_exit_cpu()) to rip each CPU-bound event out of its PMU's cpu
    context, but leave siblings grouped together. Freeing of these events is
    left to the mercy of the usual refcounting.

    When a CPU-bound event's refcount drops to zero we cross-call to
    __perf_remove_from_context() to clean it up, detaching grouped siblings.

    This works when the relevant CPU is online, but will fail if the CPU is
    currently offline, and we won't detach the event from its siblings
    before freeing the event, leaving the sibling list corrupt. If the
    sibling list is later walked (e.g. because the CPU cam online again
    before a remaining sibling's refcount drops to zero), we will walk the
    now corrupted siblings list, potentially dereferencing garbage values.

    Given that the events should never be scheduled again (as we removed
    them from their context), we can simply detatch siblings when the CPU
    goes down in the first place. If the CPU comes back online, the
    redundant call to __perf_remove_from_context() is safe.

    Reported-by: Drew Richardson
    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: vincent.weaver@maine.edu
    Cc: Vince Weaver
    Cc: Will Deacon
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415203904-25308-2-git-send-email-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

15 Nov, 2014

1 commit

  • Pull ACPI and power management fixes from Rafael Wysocki:
    "These are three regression fixes, two recent (generic power domains,
    suspend-to-idle) and one older (cpufreq), an ACPI blacklist entry for
    one more machine having problems with Windows 8 compatibility, a minor
    cpufreq driver fix (cpufreq-dt) and a fixup for new callback
    definitions (generic power domains).

    Specifics:

    - Fix a crash in the suspend-to-idle code path introduced by a recent
    commit that forgot to check a pointer against NULL before
    dereferencing it (Dmitry Eremin-Solenikov).

    - Fix a boot crash on Exynos5 introduced by a recent commit making
    that platform use generic Device Tree bindings for power domains
    which exposed a weakness in the generic power domains framework
    leading to that crash (Ulf Hansson).

    - Fix a crash during system resume on systems where cpufreq depends
    on Operation Performance Points (OPP) for functionality, but
    CONFIG_OPP is not set. This leads the cpufreq driver registration
    to fail, but the resume code attempts to restore the pre-suspend
    cpufreq configuration (which does not exist) nevertheless and
    crashes. From Geert Uytterhoeven.

    - Add a new ACPI blacklist entry for Dell Vostro 3546 that has
    problems if it is reported as Windows 8 compatible to the BIOS
    (Adam Lee).

    - Fix swapped arguments in an error message in the cpufreq-dt driver
    (Abhilash Kesavan).

    - Fix up the prototypes of new callbacks in struct generic_pm_domain
    to make them more useful. Users of those callbacks will be added
    in 3.19 and it's better for them to be based on the correct struct
    definition in mainline from the start. From Ulf Hansson and Kevin
    Hilman"

    * tag 'pm+acpi-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / Domains: Fix initial default state of the need_restore flag
    PM / sleep: Fix entering suspend-to-IDLE if no freeze_oops is set
    PM / Domains: Change prototype for the attach and detach callbacks
    cpufreq: Avoid crash in resume on SMP without OPP
    cpufreq: cpufreq-dt: Fix arguments in clock failure error message
    ACPI / blacklist: blacklist Win8 OSI for Dell Vostro 3546

    Linus Torvalds
     

14 Nov, 2014

2 commits

  • Commit 69361eef9056 ("panic: add TAINT_SOFTLOCKUP") added the 'L' flag,
    but failed to update the comments for print_tainted(). So, update the
    comments.

    Signed-off-by: Xie XiuQi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • Pull audit fixes from Paul Moore:
    "After he sent the initial audit pull request for 3.18, Eric asked me
    to take over the management of the audit tree, hence this pull request
    to fix a couple of problems with audit.

    As you can see below, the changes are minimal: adding some whitespace
    to a string so userspace parses it correctly, and fixing a problem
    with audit's usage of fsnotify that was causing audit watch rules to
    be lost. Neither of these patches were very controversial on the
    mailing lists and they fix real problems, getting them into 3.18 would
    be a good thing"

    * 'stable-3.18' of git://git.infradead.org/users/pcmoore/audit:
    audit: keep inode pinned
    audit: AUDIT_FEATURE_CHANGE message format missing delimiting space

    Linus Torvalds
     

12 Nov, 2014

1 commit

  • Audit rules disappear when an inode they watch is evicted from the cache.
    This is likely not what we want.

    The guilty commit is "fsnotify: allow marks to not pin inodes in core",
    which didn't take into account that audit_tree adds watches with a zero
    mask.

    Adding any mask should fix this.

    Fixes: 90b1e7a57880 ("fsnotify: allow marks to not pin inodes in core")
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org # 2.6.36+
    Signed-off-by: Paul Moore

    Miklos Szeredi
     

11 Nov, 2014

2 commits

  • If the read loop in trace_buffers_splice_read() keeps failing due to
    memory allocation failures without reading even a single page then this
    function will keep busy looping.

    Remove the risk for that by exiting the function if memory allocation
    failures are seen.

    Link: http://lkml.kernel.org/r/1415309167-2373-2-git-send-email-rabin@rab.in

    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     
  • On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
    lockup:

    # trace-cmd record -e raw_syscalls:* -F false
    NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
    ...
    Call Trace:
    [] ? __wake_up_common+0x90/0x90
    [] wait_on_pipe+0x35/0x40
    [] tracing_buffers_splice_read+0x2e3/0x3c0
    [] ? tracing_stats_read+0x2a0/0x2a0
    [] ? _raw_spin_unlock+0x2b/0x40
    [] ? do_read_fault+0x21b/0x290
    [] ? handle_mm_fault+0x2ba/0xbd0
    [] ? trace_event_buffer_lock_reserve+0x40/0x80
    [] ? trace_buffer_lock_reserve+0x22/0x60
    [] ? trace_event_buffer_lock_reserve+0x40/0x80
    [] do_splice_to+0x6d/0x90
    [] SyS_splice+0x7c1/0x800
    [] tracesys_phase2+0xd3/0xd8

    The problem is this: tracing_buffers_splice_read() calls
    ring_buffer_wait() to wait for data in the ring buffers. The buffers
    are not empty so ring_buffer_wait() returns immediately. But
    tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
    meaning it only wants to read a full page. When the full page is not
    available, tracing_buffers_splice_read() tries to wait again with
    ring_buffer_wait(), which again returns immediately, and so on.

    Fix this by adding a "full" argument to ring_buffer_wait() which will
    make ring_buffer_wait() wait until the writer has left the reader's
    page, i.e. until full-page reads will succeed.

    Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in

    Cc: stable@vger.kernel.org # 3.16+
    Fixes: b1169cc69ba9 ("tracing: Remove mock up poll wait function")
    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     

10 Nov, 2014

1 commit

  • On latest mm + KASan patchset I've got this:

    ==================================================================
    BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
    =============================================================================
    BUG kmalloc-8 (Not tainted): kasan error
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
    __slab_alloc+0x4b4/0x4f0
    __kmalloc_track_caller+0x15f/0x1e0
    kstrdup+0x44/0x90
    alloc_vfsmnt+0xb0/0x2c0
    vfs_kern_mount+0x35/0x190
    kern_mount_data+0x25/0x50
    pid_ns_prepare_proc+0x19/0x50
    alloc_pid+0x5e2/0x630
    copy_process.part.41+0xdf5/0x2aa0
    do_fork+0xf5/0x460
    kernel_thread+0x21/0x30
    rest_init+0x1e/0x90
    start_kernel+0x522/0x531
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x15b/0x16a
    INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
    INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70

    Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
    Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk.
    Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........
    Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
    CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
    ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
    ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    print_trailer (mm/slub.c:645)
    object_err (mm/slub.c:652)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_kmalloc (mm/kasan/kasan.c:311)
    __asan_load4 (mm/kasan/kasan.c:371)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kernel_init_freeable (init/main.c:869 init/main.c:997)
    ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
    ? rest_init (init/main.c:924)
    kernel_init (init/main.c:929)
    ? rest_init (init/main.c:924)
    ret_from_fork (arch/x86/kernel/entry_64.S:348)
    ? rest_init (init/main.c:924)
    Read of size 4 by task swapper/0:
    Memory state around the buggy address:
    ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
    ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
    ^
    ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Zero 'level' (e.g. on non-NUMA system) causing out of bounds
    access in this line:

    sched_max_numa_distance = sched_domains_numa_distance[level - 1];

    Fix this by exiting from sched_init_numa() earlier.

    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     

09 Nov, 2014

1 commit


04 Nov, 2014

1 commit

  • sched_move_task() is the only interface to change sched_task_group:
    cpu_cgrp_subsys methods and autogroup_move_group() use it.

    Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
    is ordered with other users of sched_move_task(). This means we do no
    need RCU here: if we've dereferenced a tg here, the .attach method
    hasn't been called for it yet.

    Thus, we should pass "true" to task_css_check() to silence lockdep
    warnings.

    Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
    Reported-by: Oleg Nesterov
    Reported-by: Fengguang Wu
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

01 Nov, 2014

8 commits

  • Pull ACPI and power management fixes from Rafael Wysocki:
    "These are fixes received after my previous pull request plus one that
    has been in the works for quite a while, but its previous version
    caused problems to happen, so it's been deferred till now.

    Fixed are two recent regressions (MFD enumeration and cpufreq-dt),
    ACPI EC regression introduced in 3.17, system suspend error code path
    regression introduced in 3.15, an older bug related to recovery from
    failing resume from hibernation and a cpufreq-dt driver issue related
    to operation performance points.

    Specifics:

    - Fix a crash on r8a7791/koelsch during resume from system suspend
    caused by a recent cpufreq-dt commit (Geert Uytterhoeven).

    - Fix an MFD enumeration problem introduced by a recent commit adding
    ACPI support to the MFD subsystem that exposed a weakness in the
    ACPI core causing ACPI enumeration to be applied to all devices
    associated with one ACPI companion object, although it should be
    used for one of them only (Mika Westerberg).

    - Fix an ACPI EC regression introduced during the 3.17 cycle causing
    some Samsung laptops to misbehave as a result of a workaround
    targeted at some Acer machines. That includes a revert of a commit
    that went too far and a quirk for the Acer machines in question.
    From Lv Zheng.

    - Fix a regression in the system suspend error code path introduced
    during the 3.15 cycle that causes it to fail to take errors from
    asychronous execution of "late" suspend callbacks into account
    (Imre Deak).

    - Fix a long-standing bug in the hibernation resume error code path
    that fails to roll back everything correcty on "freeze" callback
    errors and leaves some devices in a "suspended" state causing more
    breakage to happen subsequently (Imre Deak).

    - Make the cpufreq-dt driver disable operation performance points
    that are not supported by the VR connected to the CPU voltage plane
    with acceptable tolerance instead of constantly failing voltage
    scaling later on (Lucas Stach)"

    * tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI / EC: Fix regression due to conflicting firmware behavior between Samsung and Acer.
    Revert "ACPI / EC: Add support to disallow QR_EC to be issued before completing previous QR_EC"
    cpufreq: cpufreq-dt: Restore default cpumask_setall(policy->cpus)
    PM / Sleep: fix recovery during resuming from hibernation
    PM / Sleep: fix async suspend_late/freeze_late error handling
    ACPI: Use ACPI companion to match only the first physical device
    cpufreq: cpufreq-dt: disable unsupported OPPs

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "A bit has accumulated, but it's been a week or so since my last batch
    of post-merge-window fixes, so...

    1) Missing module license in netfilter reject module, from Pablo.
    Lots of people ran into this.

    2) Off by one in mac80211 baserate calculation, from Karl Beldan.

    3) Fix incorrect return value from ax88179_178a driver's set_mac_addr
    op, which broke use of it with bonding. From Ian Morgan.

    4) Checking of skb_gso_segment()'s return value was not all
    encompassing, it can return an SKB pointer, a pointer error, or
    NULL. Fix from Florian Westphal.

    This is crummy, and longer term will be fixed to just return error
    pointers or a real SKB.

    6) Encapsulation offloads not being handled by
    skb_gso_transport_seglen(). From Florian Westphal.

    7) Fix deadlock in TIPC stack, from Ying Xue.

    8) Fix performance regression from using rhashtable for netlink
    sockets. The problem was the synchronize_net() invoked for every
    socket destroy. From Thomas Graf.

    9) Fix bug in eBPF verifier, and remove the strong dependency of BPF
    on NET. From Alexei Starovoitov.

    10) In qdisc_create(), use the correct interface to allocate
    ->cpu_bstats, otherwise the u64_stats_sync member isn't
    initialized properly. From Sabrina Dubroca.

    11) Off by one in ip_set_nfnl_get_byindex(), from Dan Carpenter.

    12) nf_tables_newchain() was erroneously expecting error pointers from
    netdev_alloc_pcpu_stats(). It only returna a valid pointer or
    NULL. From Sabrina Dubroca.

    13) Fix use-after-free in _decode_session6(), from Li RongQing.

    14) When we set the TX flow hash on a socket, we mistakenly do so
    before we've nailed down the final source port. Move the setting
    deeper to fix this. From Sathya Perla.

    15) NAPI budget accounting in amd-xgbe driver was counting descriptors
    instead of full packets, fix from Thomas Lendacky.

    16) Fix total_data_buflen calculation in hyperv driver, from Haiyang
    Zhang.

    17) Fix bcma driver build with OF_ADDRESS disabled, from Hauke
    Mehrtens.

    18) Fix mis-use of per-cpu memory in TCP md5 code. The problem is
    that something that ends up being vmalloc memory can't be passed
    to the crypto hash routines via scatter-gather lists. From Eric
    Dumazet.

    19) Fix regression in promiscuous mode enabling in cdc-ether, from
    Olivier Blin.

    20) Bucket eviction and frag entry killing can race with eachother,
    causing an unlink of the object from the wrong list. Fix from
    Nikolay Aleksandrov.

    21) Missing initialization of spinlock in cxgb4 driver, from Anish
    Bhatt.

    22) Do not cache ipv4 routing failures, otherwise if the sysctl for
    forwarding is subsequently enabled this won't be seen. From
    Nicolas Cavallari"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (131 commits)
    drivers: net: cpsw: Support ALLMULTI and fix IFF_PROMISC in switch mode
    drivers: net: cpsw: Fix broken loop condition in switch mode
    net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0
    stmmac: pci: set default of the filter bins
    net: smc91x: Fix gpios for device tree based booting
    mpls: Allow mpls_gso to be built as module
    mpls: Fix mpls_gso handler.
    r8152: stop submitting intr for -EPROTO
    netfilter: nft_reject_bridge: restrict reject to prerouting and input
    netfilter: nft_reject_bridge: don't use IP stack to reject traffic
    netfilter: nf_reject_ipv6: split nf_send_reset6() in smaller functions
    netfilter: nf_reject_ipv4: split nf_send_reset() in smaller functions
    netfilter: nf_tables_bridge: update hook_mask to allow {pre,post}routing
    drivers/net: macvtap and tun depend on INET
    drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
    drivers/net: Disable UFO through virtio
    net: skb_fclone_busy() needs to detect orphaned skb
    gre: Use inner mac length when computing tunnel length
    mlx4: Avoid leaking steering rules on flow creation error flow
    net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Various scheduler fixes all over the place: three SCHED_DL fixes,
    three sched/numa fixes, two generic race fixes and a comment fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/dl: Fix preemption checks
    sched: Update comments for CLONE_NEWNS
    sched: stop the unbound recursion in preempt_schedule_context()
    sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
    sched/fair: Care divide error in update_task_scan_period()
    sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
    sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
    sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
    sched: Fix race between task_group and sched_task_group

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, plus on the kernel side:

    - a revert for a newly introduced PMU driver which isn't complete yet
    and where we ran out of time with fixes (to be tried again in
    v3.19) - this makes up for a large chunk of the diffstat.

    - compilation warning fixes

    - a printk message fix

    - event_idx usage fixes/cleanups"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf probe: Trivial typo fix for --demangle
    perf tools: Fix report -F dso_from for data without branch info
    perf tools: Fix report -F dso_to for data without branch info
    perf tools: Fix report -F symbol_from for data without branch info
    perf tools: Fix report -F symbol_to for data without branch info
    perf tools: Fix report -F mispredict for data without branch info
    perf tools: Fix report -F in_tx for data without branch info
    perf tools: Fix report -F abort for data without branch info
    perf tools: Make CPUINFO_PROC an array to support different kernel versions
    perf callchain: Use global caching provided by libunwind
    perf/x86/intel: Revert incomplete and undocumented Broadwell client support
    perf/x86: Fix compile warnings for intel_uncore
    perf: Fix typos in sample code in the perf_event.h header
    perf: Fix and clean up initialization of pmu::event_idx
    perf: Fix bogus kernel printk
    perf diff: Add missing hists__init() call at tool start

    Linus Torvalds
     
  • Pull futex fixes from Ingo Molnar:
    "This contains two futex fixes: one fixes a race condition, the other
    clarifies shared/private futex comments"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Fix a race condition between REQUEUE_PI and task death
    futex: Mention key referencing differences between shared and private futexes

    Linus Torvalds
     
  • Pull core fixes from Ingo Molnar:
    "The tree contains two RCU fixes and a compiler quirk comment fix"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Make rcu_barrier() understand about missing rcuo kthreads
    compiler/gcc4+: Remove inaccurate comment about 'asm goto' miscompiles
    rcu: More on deadlock between CPU hotplug and expedited grace periods

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "As you requested in the rc2 release mail the timer department serves
    you a few real bug fixes:

    - Fix the probe logic of the architected arm/arm64 timer
    - Plug a stack info leak in posix-timers
    - Prevent a shift out of bounds issue in the clockevents core"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ARM/ARM64: arch-timer: fix arch_timer_probed logic
    clockevents: Prevent shift out of bounds
    posix-timers: Fix stack info leak in timer_create()

    Linus Torvalds
     
  • …/git/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "ARM has system calls outside the NR_syscalls range, and the generic
    tracing system does not support that and without checks, it can cause
    an oops to be reported.

    Rabin Vincent added checks in the return code on syscall events to
    make sure that the system call number is within the range that tracing
    knows about, and if not, simply ignores the system call.

    The system call tracing infrastructure needs to be rewritten to handle
    these cases better, but for now, to keep from oopsing, this patch will
    do"

    * tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/syscalls: Ignore numbers outside NR_syscalls' range

    Linus Torvalds
     

31 Oct, 2014

2 commits

  • ARM has some private syscalls (for example, set_tls(2)) which lie
    outside the range of NR_syscalls. If any of these are called while
    syscall tracing is being performed, out-of-bounds array access will
    occur in the ftrace and perf sys_{enter,exit} handlers.

    # trace-cmd record -e raw_syscalls:* true && trace-cmd report
    ...
    true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
    true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
    true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
    true-653 [000] 384.675988: sys_exit: NR 983045 = 0
    ...

    # trace-cmd record -e syscalls:* true
    [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
    [ 17.289590] pgd = 9e71c000
    [ 17.289696] [aaaaaace] *pgd=00000000
    [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
    [ 17.290169] Modules linked in:
    [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
    [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
    [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
    [ 17.290866] LR is at syscall_trace_enter+0x124/0x184

    Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.

    Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
    added the check for less than zero, but it should have also checked
    for greater than NR_syscalls.

    Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in

    Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
    Cc: stable@vger.kernel.org # 2.6.33+
    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     
  • Add a space between subj= and feature= fields to make them parsable.

    Signed-off-by: Richard Guy Briggs
    Cc: stable@vger.kernel.org
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

30 Oct, 2014

3 commits

  • …/paulmck/linux-rcu into core/urgent

    Pull two RCU fixes from Paul E. McKenney:

    " - Complete the work of commit dd56af42bd82 (rcu: Eliminate deadlock
    between CPU hotplug and expedited grace periods), which was
    intended to allow synchronize_sched_expedited() to be safely
    used when holding locks acquired by CPU-hotplug notifiers.
    This commit makes the put_online_cpus() avoid the deadlock
    instead of just handling the get_online_cpus().

    - Complete the work of commit 35ce7f29a44a (rcu: Create rcuo
    kthreads only for onlined CPUs), which was intended to allow
    RCU to avoid allocating unneeded kthreads on systems where the
    firmware says that there are more CPUs than are really present.
    This commit makes rcu_barrier() aware of the mismatch, so that
    it doesn't hang waiting for non-existent CPUs. "

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Found this in the message log on a s390 system:

    BUG kmalloc-192 (Not tainted): Poison overwritten
    Disabling lock debugging due to kernel taint
    INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b
    INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
    __slab_alloc.isra.47.constprop.56+0x5f6/0x658
    kmem_cache_alloc_trace+0x106/0x408
    call_usermodehelper_setup+0x70/0x128
    call_usermodehelper+0x62/0x90
    cgroup_release_agent+0x178/0x1c0
    process_one_work+0x36e/0x680
    worker_thread+0x2f0/0x4f8
    kthread+0x10a/0x120
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc
    INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
    __slab_free+0x94/0x560
    kfree+0x364/0x3e0
    call_usermodehelper_exec+0x110/0x1b8
    cgroup_release_agent+0x178/0x1c0
    process_one_work+0x36e/0x680
    worker_thread+0x2f0/0x4f8
    kthread+0x10a/0x120
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc

    There is a use-after-free bug on the subprocess_info structure allocated
    by the user mode helper. In case do_execve() returns with an error
    ____call_usermodehelper() stores the error code to sub_info->retval, but
    sub_info can already have been freed.

    Regarding UMH_NO_WAIT, the sub_info structure can be freed by
    __call_usermodehelper() before the worker thread returns from
    do_execve(), allowing memory corruption when do_execve() failed after
    exec_mmap() is called.

    Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
    call_usermodehelper_exec() to continue which then frees sub_info.

    To fix this race the code needs to make sure that the call to
    call_usermodehelper_freeinfo() is always done after the last store to
    sub_info->retval.

    Signed-off-by: Martin Schwidefsky
    Reviewed-by: Oleg Nesterov
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Following up the arm testing of gcov, turns out gcov on ARM64 works fine
    as well. Only change needed is adding ARM64 to Kconfig depends.

    Tested with qemu and mach-virt

    Signed-off-by: Riku Voipio
    Acked-by: Peter Oberparleiter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Riku Voipio
     

29 Oct, 2014

2 commits

  • …it/rostedt/linux-trace

    Pull ftrace trampoline accounting fixes from Steven Rostedt:
    "Adding the new code for 3.19, I discovered a couple of minor bugs with
    the accounting of the ftrace_ops trampoline logic.

    One was that the old hash was not updated before calling the modify
    code for an ftrace_ops. The second bug was what let the first bug go
    unnoticed, as the update would check the current hash for all
    ftrace_ops (where it should only check the old hash for modified
    ones). This let things work when only one ftrace_ops was registered
    to a function, but could break if more than one was registered
    depending on the order of the look ups.

    The worse thing that can happen if this bug triggers is that the
    ftrace self checks would find an anomaly and shut itself down"

    * tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Fix checking of trampoline ftrace_ops in finding trampoline
    ftrace: Set ops->old_hash on modifying what an ops hooks to

    Linus Torvalds
     
  • Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
    avoids creating rcuo kthreads for CPUs that never come online. This
    fixes a bug in many instances of firmware: Instead of lying about their
    age, these systems instead lie about the number of CPUs that they have.
    Before commit 35ce7f29a44a, this could result in huge numbers of useless
    rcuo kthreads being created.

    It appears that experience indicates that I should have told the
    people suffering from this problem to fix their broken firmware, but
    I instead produced what turned out to be a partial fix. The missing
    piece supplied by this commit makes sure that rcu_barrier() knows not to
    post callbacks for no-CBs CPUs that have not yet come online, because
    otherwise rcu_barrier() will hang on systems having firmware that lies
    about the number of CPUs.

    It is tempting to simply have rcu_barrier() refuse to post a callback on
    any no-CBs CPU that does not have an rcuo kthread. This unfortunately
    does not work because rcu_barrier() is required to wait for all pending
    callbacks. It is therefore required to wait even for those callbacks
    that cannot possibly be invoked. Even if doing so hangs the system.

    Given that posting a callback to a no-CBs CPU that does not yet have an
    rcuo kthread can hang rcu_barrier(), It is tempting to report an error
    in this case. Unfortunately, this will result in false positives at
    boot time, when it is perfectly legal to post callbacks to the boot CPU
    before the scheduler has started, in other words, before it is legal
    to invoke rcu_barrier().

    So this commit instead has rcu_barrier() avoid posting callbacks to
    CPUs having neither rcuo kthread nor pending callbacks, and has it
    complain bitterly if it finds CPUs having no rcuo kthread but some
    pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
    kthread but pending callbacks, as noted earlier, it has no choice but
    to hang indefinitely.

    Reported-by: Yanko Kaneti
    Reported-by: Jay Vosburgh
    Reported-by: Meelis Roos
    Reported-by: Eric B Munson
    Signed-off-by: Paul E. McKenney
    Tested-by: Eric B Munson
    Tested-by: Jay Vosburgh
    Tested-by: Yanko Kaneti
    Tested-by: Kevin Fenzi
    Tested-by: Meelis Roos

    Paul E. McKenney
     

28 Oct, 2014

8 commits

  • Andy reported that the current state of event_idx is rather confused.
    So remove all but the x86_pmu implementation and change the default to
    return 0 (the safe option).

    Reported-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Benjamin Herrenschmidt
    Cc: Christoph Lameter
    Cc: Cody P Schafer
    Cc: Cody P Schafer
    Cc: Heiko Carstens
    Cc: Hendrik Brueckner
    Cc: Himangi Saraogi
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: sukadev@linux.vnet.ibm.com
    Cc: Thomas Huth
    Cc: Vince Weaver
    Cc: linux390@de.ibm.com
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-s390@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • 1) switched_to_dl() check is wrong. We reschedule only
    if rq->curr is deadline task, and we do not reschedule
    if it's a lower priority task. But we must always
    preempt a task of other classes.

    2) dl_task_timer():
    Policy does not change in case of priority inheritance.
    rt_mutex_setprio() changes prio, while policy remains old.

    So we lose some balancing logic in dl_task_timer() and
    switched_to_dl() when we check policy instead of priority. Boosted
    task may be rq->curr.

    (I didn't change switched_from_dl() because no check is necessary
    there at all).

    I've looked at this place(switched_to_dl) several times and even fixed
    this function, but found just now... I suppose some performance tests
    may work better after this.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • preempt_schedule_context() does preempt_enable_notrace() at the end
    and this can call the same function again; exception_exit() is heavy
    and it is quite possible that need-resched is true again.

    1. Change this code to dec preempt_count() and check need_resched()
    by hand.

    2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
    the enable/disable dance around __schedule(). But in this case
    we need to move into sched/core.c.

    3. Cosmetic, but x86 forgets to declare this function. This doesn't
    really matter because it is only called by asm helpers, still it
    make sense to add the declaration into asm/preempt.h to match
    preempt_schedule().

    Reported-by: Sasha Levin
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Graf
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Steven Rostedt
    Cc: Peter Anvin
    Cc: Andy Lutomirski
    Cc: Denys Vlasenko
    Cc: Chuck Ebbert
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

    This bash command reproduces problem:

    $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
    echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

    divide error: 0000 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
    RIP: 0010:[] [] task_scan_min+0x21/0x50
    RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
    RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
    RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
    R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
    R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
    FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
    Stack:
    ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
    ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
    ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
    Call Trace:
    [] task_scan_max+0x11/0x40
    [] task_numa_fault+0x1f7/0xae0
    [] ? migrate_misplaced_page+0x276/0x300
    [] handle_mm_fault+0x62d/0xba0
    [] __do_page_fault+0x191/0x510
    [] ? native_smp_send_reschedule+0x42/0x60
    [] ? check_preempt_curr+0x80/0xa0
    [] ? wake_up_new_task+0x11c/0x1a0
    [] ? do_fork+0x14d/0x340
    [] ? get_unused_fd_flags+0x2b/0x30
    [] ? __fd_install+0x1f/0x60
    [] do_page_fault+0xc/0x10
    [] page_fault+0x22/0x30
    RIP [] task_scan_min+0x21/0x50
    RSP
    ---[ end trace 9a826d16936c04de ]---

    Also fix race in task_scan_min (it depends on compiler behaviour).

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Tomlin
    Cc: Andrew Morton
    Cc: Dario Faggioli
    Cc: David Rientjes
    Cc: Jens Axboe
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • While offling node by hot removing memory, the following divide error
    occurs:

    divide error: 0000 [#1] SMP
    [...]
    Call Trace:
    [...] handle_mm_fault
    [...] ? try_to_wake_up
    [...] ? wake_up_state
    [...] __do_page_fault
    [...] ? do_futex
    [...] ? put_prev_entity
    [...] ? __switch_to
    [...] do_page_fault
    [...] page_fault
    [...]
    RIP [] task_numa_fault
    RSP

    The issue occurs as follows:
    1. When page fault occurs and page is allocated from node 1,
    task_struct->numa_faults_buffer_memory[] of node 1 is
    incremented and p->numa_faults_locality[] is also incremented
    as follows:

    o numa_faults_buffer_memory[] o numa_faults_locality[]
    NR_NUMA_HINT_FAULT_TYPES
    | 0 | 1 |
    ---------------------------------- ----------------------
    node 0 | 0 | 0 | remote | 0 |
    node 1 | 0 | 1 | locale | 1 |
    ---------------------------------- ----------------------

    2. node 1 is offlined by hot removing memory.

    3. When page fault occurs, fault_types[] is calculated by using
    p->numa_faults_buffer_memory[] of all online nodes in
    task_numa_placement(). But node 1 was offline by step 2. So
    the fault_types[] is calculated by using only
    p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
    are set to 0.

    4. The values(0) of fault_types[] pass to update_task_scan_period().

    5. numa_faults_locality[1] is set to 1. So the following division is
    calculated.

    static void update_task_scan_period(struct task_struct *p,
    unsigned long shared, unsigned long private){
    ...
    ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
    }

    6. But both of private and shared are set to 0. So divide error
    occurs here.

    The divide error is rare case because the trigger is node offline.
    This patch always increments denominator for avoiding divide error.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
    Signed-off-by: Ingo Molnar

    Yasuaki Ishimatsu
     
  • Unlocked access to dst_rq->curr in task_numa_compare() is racy.
    If curr task is exiting this may be a reason of use-after-free:

    task_numa_compare() do_exit()
    ... current->flags |= PF_EXITING;
    ... release_task()
    ... ~~delayed_put_task_struct()~~
    ... schedule()
    rcu_read_lock() ...
    cur = ACCESS_ONCE(dst_rq->curr) ...
    ... rq->curr = next;
    ... context_switch()
    ... finish_task_switch()
    ... put_task_struct()
    ... __put_task_struct()
    ... free_task_struct()
    task_numa_assign() ...
    get_task_struct() ...

    As noted by Oleg:

    <
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • dl_task_timer() is racy against several paths. Daniel noticed that
    the replenishment timer may experience a race condition against an
    enqueue_dl_entity() called from rt_mutex_setprio(). With his own
    words:

    rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
    start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
    sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
    throttled is 0

    => BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
    enqueued on the -deadline runqueue.

    As we do for the other races, we just bail out in the replenishment
    timer code.

    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: vincent@legout.info
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • In the deboost path, right after the dl_boosted flag has been
    reset, we can currently end up replenishing using -deadline
    parameters of a !SCHED_DEADLINE entity. This of course causes
    a bug, as those parameters are empty.

    In the case depicted above it is safe to simply bail out, as
    the deboosted task is going to be back to its original scheduling
    class anyway.

    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: vincent@legout.info
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli