01 Nov, 2014

8 commits

  • Pull ACPI and power management fixes from Rafael Wysocki:
    "These are fixes received after my previous pull request plus one that
    has been in the works for quite a while, but its previous version
    caused problems to happen, so it's been deferred till now.

    Fixed are two recent regressions (MFD enumeration and cpufreq-dt),
    ACPI EC regression introduced in 3.17, system suspend error code path
    regression introduced in 3.15, an older bug related to recovery from
    failing resume from hibernation and a cpufreq-dt driver issue related
    to operation performance points.

    Specifics:

    - Fix a crash on r8a7791/koelsch during resume from system suspend
    caused by a recent cpufreq-dt commit (Geert Uytterhoeven).

    - Fix an MFD enumeration problem introduced by a recent commit adding
    ACPI support to the MFD subsystem that exposed a weakness in the
    ACPI core causing ACPI enumeration to be applied to all devices
    associated with one ACPI companion object, although it should be
    used for one of them only (Mika Westerberg).

    - Fix an ACPI EC regression introduced during the 3.17 cycle causing
    some Samsung laptops to misbehave as a result of a workaround
    targeted at some Acer machines. That includes a revert of a commit
    that went too far and a quirk for the Acer machines in question.
    From Lv Zheng.

    - Fix a regression in the system suspend error code path introduced
    during the 3.15 cycle that causes it to fail to take errors from
    asychronous execution of "late" suspend callbacks into account
    (Imre Deak).

    - Fix a long-standing bug in the hibernation resume error code path
    that fails to roll back everything correcty on "freeze" callback
    errors and leaves some devices in a "suspended" state causing more
    breakage to happen subsequently (Imre Deak).

    - Make the cpufreq-dt driver disable operation performance points
    that are not supported by the VR connected to the CPU voltage plane
    with acceptable tolerance instead of constantly failing voltage
    scaling later on (Lucas Stach)"

    * tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI / EC: Fix regression due to conflicting firmware behavior between Samsung and Acer.
    Revert "ACPI / EC: Add support to disallow QR_EC to be issued before completing previous QR_EC"
    cpufreq: cpufreq-dt: Restore default cpumask_setall(policy->cpus)
    PM / Sleep: fix recovery during resuming from hibernation
    PM / Sleep: fix async suspend_late/freeze_late error handling
    ACPI: Use ACPI companion to match only the first physical device
    cpufreq: cpufreq-dt: disable unsupported OPPs

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "A bit has accumulated, but it's been a week or so since my last batch
    of post-merge-window fixes, so...

    1) Missing module license in netfilter reject module, from Pablo.
    Lots of people ran into this.

    2) Off by one in mac80211 baserate calculation, from Karl Beldan.

    3) Fix incorrect return value from ax88179_178a driver's set_mac_addr
    op, which broke use of it with bonding. From Ian Morgan.

    4) Checking of skb_gso_segment()'s return value was not all
    encompassing, it can return an SKB pointer, a pointer error, or
    NULL. Fix from Florian Westphal.

    This is crummy, and longer term will be fixed to just return error
    pointers or a real SKB.

    6) Encapsulation offloads not being handled by
    skb_gso_transport_seglen(). From Florian Westphal.

    7) Fix deadlock in TIPC stack, from Ying Xue.

    8) Fix performance regression from using rhashtable for netlink
    sockets. The problem was the synchronize_net() invoked for every
    socket destroy. From Thomas Graf.

    9) Fix bug in eBPF verifier, and remove the strong dependency of BPF
    on NET. From Alexei Starovoitov.

    10) In qdisc_create(), use the correct interface to allocate
    ->cpu_bstats, otherwise the u64_stats_sync member isn't
    initialized properly. From Sabrina Dubroca.

    11) Off by one in ip_set_nfnl_get_byindex(), from Dan Carpenter.

    12) nf_tables_newchain() was erroneously expecting error pointers from
    netdev_alloc_pcpu_stats(). It only returna a valid pointer or
    NULL. From Sabrina Dubroca.

    13) Fix use-after-free in _decode_session6(), from Li RongQing.

    14) When we set the TX flow hash on a socket, we mistakenly do so
    before we've nailed down the final source port. Move the setting
    deeper to fix this. From Sathya Perla.

    15) NAPI budget accounting in amd-xgbe driver was counting descriptors
    instead of full packets, fix from Thomas Lendacky.

    16) Fix total_data_buflen calculation in hyperv driver, from Haiyang
    Zhang.

    17) Fix bcma driver build with OF_ADDRESS disabled, from Hauke
    Mehrtens.

    18) Fix mis-use of per-cpu memory in TCP md5 code. The problem is
    that something that ends up being vmalloc memory can't be passed
    to the crypto hash routines via scatter-gather lists. From Eric
    Dumazet.

    19) Fix regression in promiscuous mode enabling in cdc-ether, from
    Olivier Blin.

    20) Bucket eviction and frag entry killing can race with eachother,
    causing an unlink of the object from the wrong list. Fix from
    Nikolay Aleksandrov.

    21) Missing initialization of spinlock in cxgb4 driver, from Anish
    Bhatt.

    22) Do not cache ipv4 routing failures, otherwise if the sysctl for
    forwarding is subsequently enabled this won't be seen. From
    Nicolas Cavallari"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (131 commits)
    drivers: net: cpsw: Support ALLMULTI and fix IFF_PROMISC in switch mode
    drivers: net: cpsw: Fix broken loop condition in switch mode
    net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0
    stmmac: pci: set default of the filter bins
    net: smc91x: Fix gpios for device tree based booting
    mpls: Allow mpls_gso to be built as module
    mpls: Fix mpls_gso handler.
    r8152: stop submitting intr for -EPROTO
    netfilter: nft_reject_bridge: restrict reject to prerouting and input
    netfilter: nft_reject_bridge: don't use IP stack to reject traffic
    netfilter: nf_reject_ipv6: split nf_send_reset6() in smaller functions
    netfilter: nf_reject_ipv4: split nf_send_reset() in smaller functions
    netfilter: nf_tables_bridge: update hook_mask to allow {pre,post}routing
    drivers/net: macvtap and tun depend on INET
    drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
    drivers/net: Disable UFO through virtio
    net: skb_fclone_busy() needs to detect orphaned skb
    gre: Use inner mac length when computing tunnel length
    mlx4: Avoid leaking steering rules on flow creation error flow
    net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Various scheduler fixes all over the place: three SCHED_DL fixes,
    three sched/numa fixes, two generic race fixes and a comment fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/dl: Fix preemption checks
    sched: Update comments for CLONE_NEWNS
    sched: stop the unbound recursion in preempt_schedule_context()
    sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
    sched/fair: Care divide error in update_task_scan_period()
    sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
    sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
    sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
    sched: Fix race between task_group and sched_task_group

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, plus on the kernel side:

    - a revert for a newly introduced PMU driver which isn't complete yet
    and where we ran out of time with fixes (to be tried again in
    v3.19) - this makes up for a large chunk of the diffstat.

    - compilation warning fixes

    - a printk message fix

    - event_idx usage fixes/cleanups"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf probe: Trivial typo fix for --demangle
    perf tools: Fix report -F dso_from for data without branch info
    perf tools: Fix report -F dso_to for data without branch info
    perf tools: Fix report -F symbol_from for data without branch info
    perf tools: Fix report -F symbol_to for data without branch info
    perf tools: Fix report -F mispredict for data without branch info
    perf tools: Fix report -F in_tx for data without branch info
    perf tools: Fix report -F abort for data without branch info
    perf tools: Make CPUINFO_PROC an array to support different kernel versions
    perf callchain: Use global caching provided by libunwind
    perf/x86/intel: Revert incomplete and undocumented Broadwell client support
    perf/x86: Fix compile warnings for intel_uncore
    perf: Fix typos in sample code in the perf_event.h header
    perf: Fix and clean up initialization of pmu::event_idx
    perf: Fix bogus kernel printk
    perf diff: Add missing hists__init() call at tool start

    Linus Torvalds
     
  • Pull futex fixes from Ingo Molnar:
    "This contains two futex fixes: one fixes a race condition, the other
    clarifies shared/private futex comments"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Fix a race condition between REQUEUE_PI and task death
    futex: Mention key referencing differences between shared and private futexes

    Linus Torvalds
     
  • Pull core fixes from Ingo Molnar:
    "The tree contains two RCU fixes and a compiler quirk comment fix"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Make rcu_barrier() understand about missing rcuo kthreads
    compiler/gcc4+: Remove inaccurate comment about 'asm goto' miscompiles
    rcu: More on deadlock between CPU hotplug and expedited grace periods

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "As you requested in the rc2 release mail the timer department serves
    you a few real bug fixes:

    - Fix the probe logic of the architected arm/arm64 timer
    - Plug a stack info leak in posix-timers
    - Prevent a shift out of bounds issue in the clockevents core"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ARM/ARM64: arch-timer: fix arch_timer_probed logic
    clockevents: Prevent shift out of bounds
    posix-timers: Fix stack info leak in timer_create()

    Linus Torvalds
     
  • …/git/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "ARM has system calls outside the NR_syscalls range, and the generic
    tracing system does not support that and without checks, it can cause
    an oops to be reported.

    Rabin Vincent added checks in the return code on syscall events to
    make sure that the system call number is within the range that tracing
    knows about, and if not, simply ignores the system call.

    The system call tracing infrastructure needs to be rewritten to handle
    these cases better, but for now, to keep from oopsing, this patch will
    do"

    * tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/syscalls: Ignore numbers outside NR_syscalls' range

    Linus Torvalds
     

31 Oct, 2014

1 commit

  • ARM has some private syscalls (for example, set_tls(2)) which lie
    outside the range of NR_syscalls. If any of these are called while
    syscall tracing is being performed, out-of-bounds array access will
    occur in the ftrace and perf sys_{enter,exit} handlers.

    # trace-cmd record -e raw_syscalls:* true && trace-cmd report
    ...
    true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
    true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
    true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
    true-653 [000] 384.675988: sys_exit: NR 983045 = 0
    ...

    # trace-cmd record -e syscalls:* true
    [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
    [ 17.289590] pgd = 9e71c000
    [ 17.289696] [aaaaaace] *pgd=00000000
    [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
    [ 17.290169] Modules linked in:
    [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
    [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
    [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
    [ 17.290866] LR is at syscall_trace_enter+0x124/0x184

    Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.

    Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
    added the check for less than zero, but it should have also checked
    for greater than NR_syscalls.

    Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in

    Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
    Cc: stable@vger.kernel.org # 2.6.33+
    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     

30 Oct, 2014

3 commits

  • …/paulmck/linux-rcu into core/urgent

    Pull two RCU fixes from Paul E. McKenney:

    " - Complete the work of commit dd56af42bd82 (rcu: Eliminate deadlock
    between CPU hotplug and expedited grace periods), which was
    intended to allow synchronize_sched_expedited() to be safely
    used when holding locks acquired by CPU-hotplug notifiers.
    This commit makes the put_online_cpus() avoid the deadlock
    instead of just handling the get_online_cpus().

    - Complete the work of commit 35ce7f29a44a (rcu: Create rcuo
    kthreads only for onlined CPUs), which was intended to allow
    RCU to avoid allocating unneeded kthreads on systems where the
    firmware says that there are more CPUs than are really present.
    This commit makes rcu_barrier() aware of the mismatch, so that
    it doesn't hang waiting for non-existent CPUs. "

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Found this in the message log on a s390 system:

    BUG kmalloc-192 (Not tainted): Poison overwritten
    Disabling lock debugging due to kernel taint
    INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b
    INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
    __slab_alloc.isra.47.constprop.56+0x5f6/0x658
    kmem_cache_alloc_trace+0x106/0x408
    call_usermodehelper_setup+0x70/0x128
    call_usermodehelper+0x62/0x90
    cgroup_release_agent+0x178/0x1c0
    process_one_work+0x36e/0x680
    worker_thread+0x2f0/0x4f8
    kthread+0x10a/0x120
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc
    INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
    __slab_free+0x94/0x560
    kfree+0x364/0x3e0
    call_usermodehelper_exec+0x110/0x1b8
    cgroup_release_agent+0x178/0x1c0
    process_one_work+0x36e/0x680
    worker_thread+0x2f0/0x4f8
    kthread+0x10a/0x120
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc

    There is a use-after-free bug on the subprocess_info structure allocated
    by the user mode helper. In case do_execve() returns with an error
    ____call_usermodehelper() stores the error code to sub_info->retval, but
    sub_info can already have been freed.

    Regarding UMH_NO_WAIT, the sub_info structure can be freed by
    __call_usermodehelper() before the worker thread returns from
    do_execve(), allowing memory corruption when do_execve() failed after
    exec_mmap() is called.

    Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
    call_usermodehelper_exec() to continue which then frees sub_info.

    To fix this race the code needs to make sure that the call to
    call_usermodehelper_freeinfo() is always done after the last store to
    sub_info->retval.

    Signed-off-by: Martin Schwidefsky
    Reviewed-by: Oleg Nesterov
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Following up the arm testing of gcov, turns out gcov on ARM64 works fine
    as well. Only change needed is adding ARM64 to Kconfig depends.

    Tested with qemu and mach-virt

    Signed-off-by: Riku Voipio
    Acked-by: Peter Oberparleiter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Riku Voipio
     

29 Oct, 2014

2 commits

  • …it/rostedt/linux-trace

    Pull ftrace trampoline accounting fixes from Steven Rostedt:
    "Adding the new code for 3.19, I discovered a couple of minor bugs with
    the accounting of the ftrace_ops trampoline logic.

    One was that the old hash was not updated before calling the modify
    code for an ftrace_ops. The second bug was what let the first bug go
    unnoticed, as the update would check the current hash for all
    ftrace_ops (where it should only check the old hash for modified
    ones). This let things work when only one ftrace_ops was registered
    to a function, but could break if more than one was registered
    depending on the order of the look ups.

    The worse thing that can happen if this bug triggers is that the
    ftrace self checks would find an anomaly and shut itself down"

    * tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Fix checking of trampoline ftrace_ops in finding trampoline
    ftrace: Set ops->old_hash on modifying what an ops hooks to

    Linus Torvalds
     
  • Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
    avoids creating rcuo kthreads for CPUs that never come online. This
    fixes a bug in many instances of firmware: Instead of lying about their
    age, these systems instead lie about the number of CPUs that they have.
    Before commit 35ce7f29a44a, this could result in huge numbers of useless
    rcuo kthreads being created.

    It appears that experience indicates that I should have told the
    people suffering from this problem to fix their broken firmware, but
    I instead produced what turned out to be a partial fix. The missing
    piece supplied by this commit makes sure that rcu_barrier() knows not to
    post callbacks for no-CBs CPUs that have not yet come online, because
    otherwise rcu_barrier() will hang on systems having firmware that lies
    about the number of CPUs.

    It is tempting to simply have rcu_barrier() refuse to post a callback on
    any no-CBs CPU that does not have an rcuo kthread. This unfortunately
    does not work because rcu_barrier() is required to wait for all pending
    callbacks. It is therefore required to wait even for those callbacks
    that cannot possibly be invoked. Even if doing so hangs the system.

    Given that posting a callback to a no-CBs CPU that does not yet have an
    rcuo kthread can hang rcu_barrier(), It is tempting to report an error
    in this case. Unfortunately, this will result in false positives at
    boot time, when it is perfectly legal to post callbacks to the boot CPU
    before the scheduler has started, in other words, before it is legal
    to invoke rcu_barrier().

    So this commit instead has rcu_barrier() avoid posting callbacks to
    CPUs having neither rcuo kthread nor pending callbacks, and has it
    complain bitterly if it finds CPUs having no rcuo kthread but some
    pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
    kthread but pending callbacks, as noted earlier, it has no choice but
    to hang indefinitely.

    Reported-by: Yanko Kaneti
    Reported-by: Jay Vosburgh
    Reported-by: Meelis Roos
    Reported-by: Eric B Munson
    Signed-off-by: Paul E. McKenney
    Tested-by: Eric B Munson
    Tested-by: Jay Vosburgh
    Tested-by: Yanko Kaneti
    Tested-by: Kevin Fenzi
    Tested-by: Meelis Roos

    Paul E. McKenney
     

28 Oct, 2014

11 commits

  • Andy reported that the current state of event_idx is rather confused.
    So remove all but the x86_pmu implementation and change the default to
    return 0 (the safe option).

    Reported-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Benjamin Herrenschmidt
    Cc: Christoph Lameter
    Cc: Cody P Schafer
    Cc: Cody P Schafer
    Cc: Heiko Carstens
    Cc: Hendrik Brueckner
    Cc: Himangi Saraogi
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: sukadev@linux.vnet.ibm.com
    Cc: Thomas Huth
    Cc: Vince Weaver
    Cc: linux390@de.ibm.com
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-s390@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • 1) switched_to_dl() check is wrong. We reschedule only
    if rq->curr is deadline task, and we do not reschedule
    if it's a lower priority task. But we must always
    preempt a task of other classes.

    2) dl_task_timer():
    Policy does not change in case of priority inheritance.
    rt_mutex_setprio() changes prio, while policy remains old.

    So we lose some balancing logic in dl_task_timer() and
    switched_to_dl() when we check policy instead of priority. Boosted
    task may be rq->curr.

    (I didn't change switched_from_dl() because no check is necessary
    there at all).

    I've looked at this place(switched_to_dl) several times and even fixed
    this function, but found just now... I suppose some performance tests
    may work better after this.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • preempt_schedule_context() does preempt_enable_notrace() at the end
    and this can call the same function again; exception_exit() is heavy
    and it is quite possible that need-resched is true again.

    1. Change this code to dec preempt_count() and check need_resched()
    by hand.

    2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
    the enable/disable dance around __schedule(). But in this case
    we need to move into sched/core.c.

    3. Cosmetic, but x86 forgets to declare this function. This doesn't
    really matter because it is only called by asm helpers, still it
    make sense to add the declaration into asm/preempt.h to match
    preempt_schedule().

    Reported-by: Sasha Levin
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Graf
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Steven Rostedt
    Cc: Peter Anvin
    Cc: Andy Lutomirski
    Cc: Denys Vlasenko
    Cc: Chuck Ebbert
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

    This bash command reproduces problem:

    $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
    echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

    divide error: 0000 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
    RIP: 0010:[] [] task_scan_min+0x21/0x50
    RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
    RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
    RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
    R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
    R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
    FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
    Stack:
    ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
    ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
    ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
    Call Trace:
    [] task_scan_max+0x11/0x40
    [] task_numa_fault+0x1f7/0xae0
    [] ? migrate_misplaced_page+0x276/0x300
    [] handle_mm_fault+0x62d/0xba0
    [] __do_page_fault+0x191/0x510
    [] ? native_smp_send_reschedule+0x42/0x60
    [] ? check_preempt_curr+0x80/0xa0
    [] ? wake_up_new_task+0x11c/0x1a0
    [] ? do_fork+0x14d/0x340
    [] ? get_unused_fd_flags+0x2b/0x30
    [] ? __fd_install+0x1f/0x60
    [] do_page_fault+0xc/0x10
    [] page_fault+0x22/0x30
    RIP [] task_scan_min+0x21/0x50
    RSP
    ---[ end trace 9a826d16936c04de ]---

    Also fix race in task_scan_min (it depends on compiler behaviour).

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Tomlin
    Cc: Andrew Morton
    Cc: Dario Faggioli
    Cc: David Rientjes
    Cc: Jens Axboe
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • While offling node by hot removing memory, the following divide error
    occurs:

    divide error: 0000 [#1] SMP
    [...]
    Call Trace:
    [...] handle_mm_fault
    [...] ? try_to_wake_up
    [...] ? wake_up_state
    [...] __do_page_fault
    [...] ? do_futex
    [...] ? put_prev_entity
    [...] ? __switch_to
    [...] do_page_fault
    [...] page_fault
    [...]
    RIP [] task_numa_fault
    RSP

    The issue occurs as follows:
    1. When page fault occurs and page is allocated from node 1,
    task_struct->numa_faults_buffer_memory[] of node 1 is
    incremented and p->numa_faults_locality[] is also incremented
    as follows:

    o numa_faults_buffer_memory[] o numa_faults_locality[]
    NR_NUMA_HINT_FAULT_TYPES
    | 0 | 1 |
    ---------------------------------- ----------------------
    node 0 | 0 | 0 | remote | 0 |
    node 1 | 0 | 1 | locale | 1 |
    ---------------------------------- ----------------------

    2. node 1 is offlined by hot removing memory.

    3. When page fault occurs, fault_types[] is calculated by using
    p->numa_faults_buffer_memory[] of all online nodes in
    task_numa_placement(). But node 1 was offline by step 2. So
    the fault_types[] is calculated by using only
    p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
    are set to 0.

    4. The values(0) of fault_types[] pass to update_task_scan_period().

    5. numa_faults_locality[1] is set to 1. So the following division is
    calculated.

    static void update_task_scan_period(struct task_struct *p,
    unsigned long shared, unsigned long private){
    ...
    ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
    }

    6. But both of private and shared are set to 0. So divide error
    occurs here.

    The divide error is rare case because the trigger is node offline.
    This patch always increments denominator for avoiding divide error.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
    Signed-off-by: Ingo Molnar

    Yasuaki Ishimatsu
     
  • Unlocked access to dst_rq->curr in task_numa_compare() is racy.
    If curr task is exiting this may be a reason of use-after-free:

    task_numa_compare() do_exit()
    ... current->flags |= PF_EXITING;
    ... release_task()
    ... ~~delayed_put_task_struct()~~
    ... schedule()
    rcu_read_lock() ...
    cur = ACCESS_ONCE(dst_rq->curr) ...
    ... rq->curr = next;
    ... context_switch()
    ... finish_task_switch()
    ... put_task_struct()
    ... __put_task_struct()
    ... free_task_struct()
    task_numa_assign() ...
    get_task_struct() ...

    As noted by Oleg:

    <
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • dl_task_timer() is racy against several paths. Daniel noticed that
    the replenishment timer may experience a race condition against an
    enqueue_dl_entity() called from rt_mutex_setprio(). With his own
    words:

    rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
    start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
    sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
    throttled is 0

    => BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
    enqueued on the -deadline runqueue.

    As we do for the other races, we just bail out in the replenishment
    timer code.

    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: vincent@legout.info
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • In the deboost path, right after the dl_boosted flag has been
    reset, we can currently end up replenishing using -deadline
    parameters of a !SCHED_DEADLINE entity. This of course causes
    a bug, as those parameters are empty.

    In the case depicted above it is safe to simply bail out, as
    the deboosted task is going to be back to its original scheduling
    class anyway.

    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: vincent@legout.info
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • The race may happen when somebody is changing task_group of a forking task.
    Child's cgroup is the same as parent's after dup_task_struct() (there just
    memory copying). Also, cfs_rq and rt_rq are the same as parent's.

    But if parent changes its task_group before it's called cgroup_post_fork(),
    we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
    the same, while child's task_group changes in cgroup_post_fork().

    To fix this we introduce fork() method, which calls sched_move_task() directly.
    This function changes sched_task_group on appropriate (also its logic has
    no problem with freshly created tasks, so we shouldn't introduce something
    special; we are able just to use it).

    Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • introduce two configs:
    - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
    depend on
    - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use

    that solves several problems:
    - tracing and others that wish to use eBPF don't need to depend on NET.
    They can use BPF_SYSCALL to allow loading from userspace or select BPF
    to use it directly from kernel in NET-less configs.
    - in 3.18 programs cannot be attached to events yet, so don't force it on
    - when the rest of eBPF infra is there in 3.19+, it's still useful to
    switch it off to minimize kernel size

    bloat-o-meter on x64 shows:
    add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)

    tested with many different config combinations. Hopefully didn't miss anything.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • If a device's dev_pm_ops::freeze callback fails during the QUIESCE
    phase, we don't rollback things correctly calling the thaw and complete
    callbacks. This could leave some devices in a suspended state in case of
    an error during resuming from hibernation.

    Signed-off-by: Imre Deak
    Cc: All applicable
    Signed-off-by: Rafael J. Wysocki

    Imre Deak
     

26 Oct, 2014

2 commits

  • free_pi_state and exit_pi_state_list both clean up futex_pi_state's.
    exit_pi_state_list takes the hb lock first, and most callers of
    free_pi_state do too. requeue_pi doesn't, which means free_pi_state
    can free the pi_state out from under exit_pi_state_list. For example:

    task A | task B
    exit_pi_state_list |
    pi_state = |
    curr->pi_state_list->next |
    | futex_requeue(requeue_pi=1)
    | // pi_state is the same as
    | // the one in task A
    | free_pi_state(pi_state)
    | list_del_init(&pi_state->list)
    | kfree(pi_state)
    list_del_init(&pi_state->list) |

    Move the free_pi_state calls in requeue_pi to before it drops the hb
    locks which it's already holding.

    [ tglx: Removed a pointless free_pi_state() call and the hb->lock held
    debugging. The latter comes via a seperate patch ]

    Signed-off-by: Brian Silverman
    Cc: austin.linux@gmail.com
    Cc: darren@dvhart.com
    Cc: peterz@infradead.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1414282837-23092-1-git-send-email-bsilver16384@gmail.com
    Signed-off-by: Thomas Gleixner

    Brian Silverman
     
  • Update our documentation as of fix 76835b0ebf8 (futex: Ensure
    get_futex_key_refs() always implies a barrier). Explicitly
    state that we don't do key referencing for private futexes.

    Signed-off-by: Davidlohr Bueso
    Cc: Matteo Franchin
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Acked-by: Catalin Marinas
    Link: http://lkml.kernel.org/r/1414121220.817.0.camel@linux-t7sj.site
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

25 Oct, 2014

4 commits

  • Andrey reported that on a kernel with UBSan enabled he found:

    UBSan: Undefined behaviour in ../kernel/time/clockevents.c:75:34

    I guess it should be 1ULL here instead of 1U:
    (!ismax || evt->mult << evt->shift)))

    That's indeed the correct solution because shift might be 32.

    Reported-by: Andrey Ryabinin
    Cc: Peter Zijlstra
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • If userland creates a timer without specifying a sigevent info, we'll
    create one ourself, using a stack local variable. Particularly will we
    use the timer ID as sival_int. But as sigev_value is a union containing
    a pointer and an int, that assignment will only partially initialize
    sigev_value on systems where the size of a pointer is bigger than the
    size of an int. On such systems we'll copy the uninitialized stack bytes
    from the timer_create() call to userland when the timer actually fires
    and we're going to deliver the signal.

    Initialize sigev_value with 0 to plug the stack info leak.

    Found in the PaX patch, written by the PaX Team.

    Fixes: 5a9fa7307285 ("posix-timers: kill ->it_sigev_signo and...")
    Signed-off-by: Mathias Krause
    Cc: Oleg Nesterov
    Cc: Brad Spengler
    Cc: PaX Team
    Cc: # v2.6.28+
    Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
    Signed-off-by: Thomas Gleixner

    Mathias Krause
     
  • When modifying code, ftrace has several checks to make sure things
    are being done correctly. One of them is to make sure any code it
    modifies is exactly what it expects it to be before it modifies it.
    In order to do so with the new trampoline logic, it must be able
    to find out what trampoline a function is hooked to in order to
    see if the code that hooks to it is what's expected.

    The logic to find the trampoline from a record (accounting descriptor
    for a function that is hooked) needs to only look at the "old_hash"
    of an ops that is being modified. The old_hash is the list of function
    an ops is hooked to before its update. Since a record would only be
    pointing to an ops that is being modified if it was already hooked
    before.

    Currently, it can pick a modified ops based on its new functions it
    will be hooked to, and this picks the wrong trampoline and causes
    the check to fail, disabling ftrace.

    Signed-off-by: Steven Rostedt

    ftrace: squash into ordering of ops for modification

    Steven Rostedt (Red Hat)
     
  • The code that checks for trampolines when modifying function hooks
    tests against a modified ops "old_hash". But the ops old_hash pointer
    is not being updated before the changes are made, making it possible
    to not find the right hash to the callback and possibly causing
    ftrace to break in accounting and disable itself.

    Have the ops set its old_hash before the modifying takes place.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

24 Oct, 2014

2 commits


23 Oct, 2014

2 commits

  • Commit dd56af42bd82 (rcu: Eliminate deadlock between CPU hotplug and
    expedited grace periods) was incomplete. Although it did eliminate
    deadlocks involving synchronize_sched_expedited()'s acquisition of
    cpu_hotplug.lock via get_online_cpus(), it did nothing about the similar
    deadlock involving acquisition of this same lock via put_online_cpus().
    This deadlock became apparent with testing involving hibernation.

    This commit therefore changes put_online_cpus() acquisition of this lock
    to be conditional, and increments a new cpu_hotplug.puts_pending field
    in case of acquisition failure. Then cpu_hotplug_begin() checks for this
    new field being non-zero, and applies any changes to cpu_hotplug.refcount.

    Reported-by: Jiri Kosina
    Signed-off-by: Paul E. McKenney
    Tested-by: Jiri Kosina
    Tested-by: Borislav Petkov

    Paul E. McKenney
     
  • Clean up the code in process.c after recent changes to get rid of
    unnecessary labels and goto statements.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

22 Oct, 2014

5 commits

  • while comparing for verifier state equivalency the comparison
    was missing a check for uninitialized register.
    Make sure it does so and add a testcase.

    Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
    Cc: Hannes Frederic Sowa
    Signed-off-by: Alexei Starovoitov
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy
    while_each_thread()) get rid of do_each_thread { } while_each_thread()
    construct and replace it by a more error prone for_each_thread.

    This patch doesn't introduce any user visible change.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki

    Michal Hocko
     
  • PM freezer relies on having all tasks frozen by the time devices are
    getting frozen so that no task will touch them while they are getting
    frozen. But OOM killer is allowed to kill an already frozen task in
    order to handle OOM situtation. In order to protect from late wake ups
    OOM killer is disabled after all tasks are frozen. This, however, still
    keeps a window open when a killed task didn't manage to die by the time
    freeze_processes finishes.

    Reduce the race window by checking all tasks after OOM killer has been
    disabled. This is still not race free completely unfortunately because
    oom_killer_disable cannot stop an already ongoing OOM killer so a task
    might still wake up from the fridge and get killed without
    freeze_processes noticing. Full synchronization of OOM and freezer is,
    however, too heavy weight for this highly unlikely case.

    Introduce and check oom_kills counter which gets incremented early when
    the allocator enters __alloc_pages_may_oom path and only check all the
    tasks if the counter changes during the freezing attempt. The counter
    is updated so early to reduce the race window since allocator checked
    oom_killer_disabled which is set by PM-freezing code. A false positive
    will push the PM-freezer into a slow path but that is not a big deal.

    Changes since v1
    - push the re-check loop out of freeze_processes into
    check_frozen_processes and invert the condition to make the code more
    readable as per Rafael

    Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
    Cc: 3.2+ # 3.2+
    Signed-off-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki

    Michal Hocko
     
  • __thaw_task() no longer clears frozen flag since commit a3201227f803
    (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE).

    Reviewed-by: Michal Hocko
    Signed-off-by: Cong Wang
    Signed-off-by: Rafael J. Wysocki

    Cong Wang
     
  • Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen
    before deferring) OOM killer relies on being able to thaw a frozen task
    to handle OOM situation but a3201227f803 (freezer: make freezing() test
    freeze conditions in effect instead of TIF_FREEZE) has reorganized the
    code and stopped clearing freeze flag in __thaw_task. This means that
    the target task only wakes up and goes into the fridge again because the
    freezing condition hasn't changed for it. This reintroduces the bug
    fixed by f660daac474c6f.

    Fix the issue by checking for TIF_MEMDIE thread flag in
    freezing_slow_path and exclude the task from freezing completely. If a
    task was already frozen it would get woken by __thaw_task from OOM killer
    and get out of freezer after rechecking freezing().

    Changes since v1
    - put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator
    as per Oleg
    - return __thaw_task into oom_scan_process_thread because
    oom_kill_process will not wake task in the fridge because it is
    sleeping uninterruptible

    [mhocko@suse.cz: rewrote the changelog]
    Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE)
    Cc: 3.3+ # 3.3+
    Signed-off-by: Cong Wang
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Signed-off-by: Rafael J. Wysocki

    Cong Wang