02 Jun, 2014

1 commit

  • Pull scheduler fixes from Ingo Molnar:
    "Various fixlets, mostly related to the (root-only) SCHED_DEADLINE
    policy, but also a hotplug bug fix and a fix for a NR_CPUS related
    overallocation bug causing a suspend/resume regression"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix hotplug vs. set_cpus_allowed_ptr()
    sched/cpupri: Replace NR_CPUS arrays
    sched/deadline: Replace NR_CPUS arrays
    sched/deadline: Restrict user params max value to 2^63 ns
    sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE
    sched: Disallow sched_attr::sched_policy < 0
    sched: Make sched_setattr() correctly return -EFBIG

    Linus Torvalds
     

01 Jun, 2014

1 commit

  • Pull core futex/rtmutex fixes from Thomas Gleixner:
    "Three fixlets for long standing issues in the futex/rtmutex code
    unearthed by Dave Jones syscall fuzzer:

    - Add missing early deadlock detection checks in the futex code
    - Prevent user space from attaching a futex to kernel threads
    - Make the deadlock detector of rtmutex work again

    Looks large, but is more comments than code change"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rtmutex: Fix deadlock detector for real
    futex: Prevent attaching to kernel threads
    futex: Add another early deadlock detection check

    Linus Torvalds
     

28 May, 2014

3 commits

  • The current deadlock detection logic does not work reliably due to the
    following early exit path:

    /*
    * Drop out, when the task has no waiters. Note,
    * top_waiter can be NULL, when we are in the deboosting
    * mode!
    */
    if (top_waiter && (!task_has_pi_waiters(task) ||
    top_waiter != task_top_pi_waiter(task)))
    goto out_unlock_pi;

    So this not only exits when the task has no waiters, it also exits
    unconditionally when the current waiter is not the top priority waiter
    of the task.

    So in a nested locking scenario, it might abort the lock chain walk
    and therefor miss a potential deadlock.

    Simple fix: Continue the chain walk, when deadlock detection is
    enabled.

    We also avoid the whole enqueue, if we detect the deadlock right away
    (A-A). It's an optimization, but also prevents that another waiter who
    comes in after the detection and before the task has undone the damage
    observes the situation and detects the deadlock and returns
    -EDEADLOCK, which is wrong as the other task is not in a deadlock
    situation.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Reviewed-by: Steven Rostedt
    Cc: Lai Jiangshan
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Pull two powerpc fixes from Ben Herrenschmidt:
    "Here's a pair of powerpc fixes for 3.15 which are also going to
    stable.

    One's a fix for building with newer binutils (the problem currently
    only affects the BookE kernels but the affected macro might come back
    into use on BookS platforms at any time). Unfortunately, the binutils
    maintainer did a backward incompatible change to a construct that we
    use so we have to add Makefile check.

    The other one is a fix for CPUs getting stuck in kexec when running
    single threaded. Since we routinely use kexec on power (including in
    our newer bootloaders), I deemed that important enough"

    * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode
    powerpc: Fix 64 bit builds with binutils 2.24

    Linus Torvalds
     
  • If we try to perform a kexec when the machine is in ST (Single-Threaded) mode
    (ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we
    get the following messages during boot:

    [ 0.089866] POWER8 performance monitor hardware support registered
    [ 0.089985] power8-pmu: PMAO restore workaround active.
    [ 5.095419] Processor 1 is stuck.
    [ 10.097933] Processor 2 is stuck.
    [ 15.100480] Processor 3 is stuck.
    [ 20.102982] Processor 4 is stuck.
    [ 25.105489] Processor 5 is stuck.
    [ 30.108005] Processor 6 is stuck.
    [ 35.110518] Processor 7 is stuck.
    [ 40.113369] Processor 9 is stuck.
    [ 45.115879] Processor 10 is stuck.
    [ 50.118389] Processor 11 is stuck.
    [ 55.120904] Processor 12 is stuck.
    [ 60.123425] Processor 13 is stuck.
    [ 65.125970] Processor 14 is stuck.
    [ 70.128495] Processor 15 is stuck.
    [ 75.131316] Processor 17 is stuck.

    Note that only the sibling threads are stuck, while the primary threads (0, 8,
    16 etc) boot just fine. Looking closer at the previous step of kexec, we observe
    that kexec tries to wakeup (bring online) the sibling threads of all the cores,
    before performing kexec:

    [ 9464.131231] Starting new kernel
    [ 9464.148507] kexec: Waking offline cpu 1.
    [ 9464.148552] kexec: Waking offline cpu 2.
    [ 9464.148600] kexec: Waking offline cpu 3.
    [ 9464.148636] kexec: Waking offline cpu 4.
    [ 9464.148671] kexec: Waking offline cpu 5.
    [ 9464.148708] kexec: Waking offline cpu 6.
    [ 9464.148743] kexec: Waking offline cpu 7.
    [ 9464.148779] kexec: Waking offline cpu 9.
    [ 9464.148815] kexec: Waking offline cpu 10.
    [ 9464.148851] kexec: Waking offline cpu 11.
    [ 9464.148887] kexec: Waking offline cpu 12.
    [ 9464.148922] kexec: Waking offline cpu 13.
    [ 9464.148958] kexec: Waking offline cpu 14.
    [ 9464.148994] kexec: Waking offline cpu 15.
    [ 9464.149030] kexec: Waking offline cpu 17.

    Instrumenting this piece of code revealed that the cpu_up() operation actually
    fails with -EBUSY. Thus, only the primary threads of all the cores are online
    during kexec, and hence this is a sure-shot receipe for disaster, as explained
    in commit e8e5c2155b (powerpc/kexec: Fix orphaned offline CPUs across kexec),
    as well as in the comment above wake_offline_cpus().

    It turns out that cpu_up() was returning -EBUSY because the variable
    'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was done
    by migrate_to_reboot_cpu() inside kernel_kexec().

    Now, migrate_to_reboot_cpu() was originally written with the assumption that
    any further code will not need to perform CPU hotplug, since we are anyway in
    the reboot path. However, kexec is clearly not such a case, since we depend on
    onlining CPUs, atleast on powerpc.

    So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the
    kexec path, to fix this regression in kexec on powerpc.

    Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we
    can catch such issues more easily in the future.

    Fixes: c97102ba963 (kexec: migrate to reboot cpu)
    Cc: stable@vger.kernel.org
    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Srivatsa S. Bhat
     

24 May, 2014

2 commits

  • Pull scheduler fixes from Ingo Molnar:
    "The biggest commit is an irqtime accounting loop latency fix, the rest
    are misc fixes all over the place: deadline scheduling, docs, numa,
    balancer and a bad to-idle latency fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/numa: Initialize newidle balance stats in sd_numa_init()
    sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance()
    sched: Skip double execution of pick_next_task_fair()
    sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check
    sched/deadline: Fix memory leak
    sched/deadline: Fix sched_yield() behavior
    sched: Sanitize irq accounting madness
    sched/docbook: Fix 'make htmldocs' warnings caused by missing description

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "The biggest changes are fixes for races that kept triggering Trinity
    crashes, plus liblockdep build fixes and smaller misc fixes.

    The liblockdep bits in perf/urgent are a pull mistake - they should
    have been in locking/urgent - but by the time I noticed other commits
    were added and testing was done :-/ Sorry about that"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
    perf: Prevent false warning in perf_swevent_add
    perf: Limit perf_event_attr::sample_period to 63 bits
    tools/liblockdep: Remove all build files when doing make clean
    tools/liblockdep: Build liblockdep from tools/Makefile
    perf/x86/intel: Fix Silvermont's event constraints
    perf: Fix perf_event_init_context()
    perf: Fix race in removing an event

    Linus Torvalds
     

22 May, 2014

7 commits

  • Lai found that:

    WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
    ...
    migration_cpu_stop+0x1d/0x22

    was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
    always a sub-set of cpu_online_mask.

    This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

    So set active and online at the same time to avoid this particular
    problem.

    Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Gautham R. Shenoy
    Cc: Linus Torvalds
    Cc: Michael wang
    Cc: Paul Gortmaker
    Cc: Rafael J. Wysocki
    Cc: Srivatsa S. Bhat
    Cc: Toshi Kani
    Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Lai Jiangshan
     
  • Tejun reported that his resume was failing due to order-3 allocations
    from sched_domain building.

    Replace the NR_CPUS arrays in there with a dynamically allocated
    array.

    Reported-by: Tejun Heo
    Signed-off-by: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Steven Rostedt
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Tejun reported that his resume was failing due to order-3 allocations
    from sched_domain building.

    Replace the NR_CPUS arrays in there with a dynamically allocated
    array.

    Reported-by: Tejun Heo
    Signed-off-by: Peter Zijlstra
    Acked-by: Juri Lelli
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-kat4gl1m5a6dwy6nzuqox45e@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Michael Kerrisk noticed that creating SCHED_DEADLINE reservations
    with certain parameters (e.g, a runtime of something near 2^64 ns)
    can cause a system freeze for some amount of time.

    The problem is that in the interface we have

    u64 sched_runtime;

    while internally we need to have a signed runtime (to cope with
    budget overruns)

    s64 runtime;

    At the time we setup a new dl_entity we copy the first value in
    the second. The cast turns out with negative values when
    sched_runtime is too big, and this causes the scheduler to go crazy
    right from the start.

    Moreover, considering how we deal with deadlines wraparound

    (s64)(a - b) < 0

    we also have to restrict acceptable values for sched_{deadline,period}.

    This patch fixes the thing checking that user parameters are always
    below 2^63 ns (still large enough for everyone).

    It also rewrites other conditions that we check, since in
    __checkparam_dl we don't have to deal with deadline wraparounds
    and what we have now erroneously fails when the difference between
    values is too big.

    Reported-by: Michael Kerrisk
    Suggested-by: Peter Zijlstra
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Dario Faggioli
    Cc: Dave Jones
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • The way we read POSIX one should only call sched_getparam() when
    sched_getscheduler() returns either SCHED_FIFO or SCHED_RR.

    Given that we currently return sched_param::sched_priority=0 for all
    others, extend the same behaviour to SCHED_DEADLINE.

    Requested-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Acked-by: Michael Kerrisk
    Cc: Dario Faggioli
    Cc: linux-man
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc:
    Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The scheduler uses policy=-1 to preserve the current policy state to
    implement sys_sched_setparam(), this got exposed to userspace by
    accident through sys_sched_setattr(), cure this.

    Reported-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Acked-by: Michael Kerrisk
    Cc:
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The documented[1] behavior of sched_attr() in the proposed man page text is:

    sched_attr::size must be set to the size of the structure, as in
    sizeof(struct sched_attr), if the provided structure is smaller
    than the kernel structure, any additional fields are assumed
    '0'. If the provided structure is larger than the kernel structure,
    the kernel verifies all additional fields are '0' if not the
    syscall will fail with -E2BIG.

    As currently implemented, sched_copy_attr() returns -EFBIG for
    for this case, but the logic in sys_sched_setattr() converts that
    error to -EFAULT. This patch fixes the behavior.

    [1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760

    Signed-off-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com
    Signed-off-by: Ingo Molnar

    Michael Kerrisk
     

21 May, 2014

1 commit

  • Pull more cgroup fixes from Tejun Heo:
    "Three more patches to fix cgroup_freezer breakage due to the recent
    cgroup internal locking changes - an operation cgroup_freezer was
    using now requires sleepable context and cgroup_freezer was invoking
    that while holding a spin lock. cgroup_freezer was using an overly
    elaborate hierarchical locking scheme.

    While it's possible to convert the hierarchical spinlocks directly to
    mutexes, this patch simplifies the overall locking so that it uses a
    global mutex. This has the added benefit of avoiding iterating
    potentially huge number of tasks under a spinlock. While the patch is
    on the larger side in the devel cycle, the changes made are mostly
    straight-forward and the locking logic is a lot simpler afterwards"

    * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix rcu_read_lock() leak in update_if_frozen()
    cgroup_freezer: replace freezer->lock with freezer_mutex
    cgroup: introduce task_css_is_root()

    Linus Torvalds
     

20 May, 2014

1 commit


19 May, 2014

5 commits

  • Alexander noticed that we use RCU iteration on rb->event_list but do
    not use list_{add,del}_rcu() to add,remove entries to that list, nor
    do we observe proper grace periods when re-using the entries.

    Merge ring_buffer_detach() into ring_buffer_attach() such that
    attaching to the NULL buffer is detaching.

    Furthermore, ensure that between any 'detach' and 'attach' of the same
    event we observe the required grace period, but only when strictly
    required. In effect this means that only ioctl(.request =
    PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
    normal initial attach and final detach will not be delayed.

    This patch should, I think, do the right thing under all
    circumstances, the 'normal' cases all should never see the extra grace
    period, but the two cases:

    1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
    ring_buffer set, will now observe the required grace period between
    removing itself from the old and attaching itself to the new buffer.

    This case is 'simple' in that both buffers are present in
    perf_event_set_output() one could think an unconditional
    synchronize_rcu() would be sufficient; however...

    2) an event that has a buffer attached, the buffer is destroyed
    (munmap) and then the event is attached to a new/different buffer
    using PERF_EVENT_IOC_SET_OUTPUT.

    This case is more complex because the buffer destruction does:
    ring_buffer_attach(.rb = NULL)
    followed by the ioctl() doing:
    ring_buffer_attach(.rb = foo);

    and we still need to observe the grace period between these two
    calls due to us reusing the event->rb_entry list_head.

    In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
    call.

    Cc: Paul Mackerras
    Cc: Stephane Eranian
    Cc: Andi Kleen
    Cc: "Paul E. McKenney"
    Cc: Ingo Molnar
    Cc: Frederic Weisbecker
    Cc: Mike Galbraith
    Reported-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • The perf cpu offline callback takes down all cpu context
    events and releases swhash->swevent_hlist.

    This could race with task context software event being just
    scheduled on this cpu via perf_swevent_add while cpu hotplug
    code already cleaned up event's data.

    The race happens in the gap between the cpu notifier code
    and the cpu being actually taken down. Note that only cpu
    ctx events are terminated in the perf cpu hotplug code.

    It's easily reproduced with:
    $ perf record -e faults perf bench sched pipe

    while putting one of the cpus offline:
    # echo 0 > /sys/devices/system/cpu/cpu1/online

    Console emits following warning:
    WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0()
    Modules linked in:
    CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G W 3.14.0+ #256
    Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
    0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005
    0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046
    ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001
    Call Trace:
    [] dump_stack+0x4f/0x7c
    [] warn_slowpath_common+0x8c/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] perf_swevent_add+0x18d/0x1a0
    [] event_sched_in.isra.75+0x9e/0x1f0
    [] group_sched_in+0x6a/0x1f0
    [] ? sched_clock_local+0x25/0xa0
    [] ctx_sched_in+0x1f6/0x450
    [] perf_event_sched_in+0x6b/0xa0
    [] perf_event_context_sched_in+0x7b/0xc0
    [] __perf_event_task_sched_in+0x43e/0x460
    [] ? put_lock_stats.isra.18+0xe/0x30
    [] finish_task_switch+0xb8/0x100
    [] __schedule+0x30e/0xad0
    [] ? pipe_read+0x3e2/0x560
    [] ? preempt_schedule_irq+0x3e/0x70
    [] ? preempt_schedule_irq+0x3e/0x70
    [] preempt_schedule_irq+0x44/0x70
    [] retint_kernel+0x20/0x30
    [] ? lockdep_sys_exit+0x1a/0x90
    [] lockdep_sys_exit_thunk+0x35/0x67
    [] ? sysret_check+0x5/0x56

    Fixing this by tracking the cpu hotplug state and displaying
    the WARN only if current cpu is initialized properly.

    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: stable@vger.kernel.org
    Reported-by: Fengguang Wu
    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.com
    Signed-off-by: Thomas Gleixner

    Jiri Olsa
     
  • Vince reported that using a large sample_period (one with bit 63 set)
    results in wreckage since while the sample_period is fundamentally
    unsigned (negative periods don't make sense) the way we implement
    things very much rely on signed logic.

    So limit sample_period to 63 bits to avoid tripping over this.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • We happily allow userspace to declare a random kernel thread to be the
    owner of a user space PI futex.

    Found while analysing the fallout of Dave Jones syscall fuzzer.

    We also should validate the thread group for private futexes and find
    some fast way to validate whether the "alleged" owner has RW access on
    the file which backs the SHM, but that's a separate issue.

    Signed-off-by: Thomas Gleixner
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Clark Williams
    Cc: Paul McKenney
    Cc: Lai Jiangshan
    Cc: Roland McGrath
    Cc: Carlos ODonell
    Cc: Jakub Jelinek
    Cc: Michael Kerrisk
    Cc: Sebastian Andrzej Siewior
    Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     
  • Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
    detection code of rtmutex:
    http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

    That underlying issue has been fixed with a patch to the rtmutex code,
    but the futex code must not call into rtmutex in that case because
    - it can detect that issue early
    - it avoids a different and more complex fixup for backing out

    If the user space variable got manipulated to 0x80000000 which means
    no lock holder, but the waiters bit set and an active pi_state in the
    kernel is found we can figure out the recursive locking issue by
    looking at the pi_state owner. If that is the current task, then we
    can safely return -EDEADLK.

    The check should have been added in commit 59fa62451 (futex: Handle
    futex_pi OWNER_DIED take over correctly) already, but I did not see
    the above issue caused by user space manipulation back then.

    Signed-off-by: Thomas Gleixner
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Clark Williams
    Cc: Paul McKenney
    Cc: Lai Jiangshan
    Cc: Roland McGrath
    Cc: Carlos ODonell
    Cc: Jakub Jelinek
    Cc: Michael Kerrisk
    Cc: Sebastian Andrzej Siewior
    Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     

13 May, 2014

5 commits

  • While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
    replace freezer->lock with freezer_mutex") introduced a bug in
    update_if_frozen() where it returns with rcu_read_lock() held. Fix it
    by adding rcu_read_unlock() before returning.

    Signed-off-by: Tejun Heo
    Reported-by: kbuild test robot

    Tejun Heo
     
  • After 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it
    to css_set_rwsem"), css task iterators requires sleepable context as
    it may block on css_set_rwsem. I missed that cgroup_freezer was
    iterating tasks under IRQ-safe spinlock freezer->lock. This leads to
    errors like the following on freezer state reads and transitions.

    BUG: sleeping function called from invalid context at /work
    /os/work/kernel/locking/rwsem.c:20
    in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
    5 locks held by bash/462:
    #0: (sb_writers#7){.+.+.+}, at: [] vfs_write+0x1a3/0x1c0
    #1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0xbb/0x170
    #2: (s_active#70){.+.+.+}, at: [] kernfs_fop_write+0xc3/0x170
    #3: (freezer_mutex){+.+...}, at: [] freezer_write+0x61/0x1e0
    #4: (rcu_read_lock){......}, at: [] freezer_write+0x53/0x1e0
    Preemption disabled at:[] console_unlock+0x1e4/0x460

    CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
    ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
    0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] __might_sleep+0x162/0x260
    [] down_read+0x24/0x60
    [] css_task_iter_start+0x27/0x70
    [] freezer_apply_state+0x5d/0x130
    [] freezer_write+0xf6/0x1e0
    [] cgroup_file_write+0xd8/0x230
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xb6/0x1c0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    freezer->lock used to be used in hot paths but that time is long gone
    and there's no reason for the lock to be IRQ-safe spinlock or even
    per-cgroup. In fact, given the fact that a cgroup may contain large
    number of tasks, it's not a good idea to iterate over them while
    holding IRQ-safe spinlock.

    Let's simplify locking by replacing per-cgroup freezer->lock with
    global freezer_mutex. This also makes the comments explaining the
    intricacies of policy inheritance and the locking around it as the
    states are protected by a common mutex.

    The conversion is mostly straight-forward. The followings are worth
    mentioning.

    * freezer_css_online() no longer needs double locking.

    * freezer_attach() now performs propagation simply while holding
    freezer_mutex. update_if_frozen() race no longer exists and the
    comment is removed.

    * freezer_fork() now tests whether the task is in root cgroup using
    the new task_css_is_root() without doing rcu_read_lock/unlock(). If
    not, it grabs freezer_mutex and performs the operation.

    * freezer_read() and freezer_change_state() grab freezer_mutex across
    the whole operation and pin the css while iterating so that each
    descendant processing happens in sleepable context.

    Fixes: 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Determining the css of a task usually requires RCU read lock as that's
    the only thing which keeps the returned css accessible till its
    reference is acquired; however, testing whether a task belongs to the
    root can be performed without dereferencing the returned css by
    comparing the returned pointer against the root one in init_css_set[]
    which never changes.

    Implement task_css_is_root() which can be invoked in any context.
    This will be used by the scheduled cgroup_freezer change.

    v2: cgroup no longer supports modular controllers. No need to export
    init_css_set. Pointed out by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Pull workqueue fixes from Tejun Heo:
    "Fixes for two bugs in workqueue.

    One is exiting with internal mutex held in a failure path of
    wq_update_unbound_numa(). The other is a subtle and unlikely
    use-after-possible-last-put in the rescuer logic. Both have been
    around for quite some time now and are unlikely to have triggered
    noticeably often. All patches are marked for -stable backport"

    * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix a possible race condition between rescuer and pwq-release
    workqueue: make rescuer_thread() empty wq->maydays list before exiting
    workqueue: fix bugs in wq_update_unbound_numa() failure path

    Linus Torvalds
     
  • Pull cgroup fixes from Tejun Heo:
    "During recent restructuring, device_cgroup unified config input check
    and enforcement logic; unfortunately, it turned out to share too much.
    Aristeu's patches fix the breakage and marked for -stable backport.

    The other two patches are fallouts from kernfs conversion. The blkcg
    change is temporary and will go away once kernfs internal locking gets
    simplified (patches pending)"

    * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()
    device_cgroup: check if exception removal is allowed
    device_cgroup: fix the comment format for recently added functions
    device_cgroup: rework device access check and exception checking
    cgroup: fix the retry path of cgroup_mount()

    Linus Torvalds
     

12 May, 2014

1 commit

  • switch_hrtimer_base() calls hrtimer_check_target() which ensures that
    we do not migrate a timer to a remote cpu if the timer expires before
    the current programmed expiry time on that remote cpu.

    But __hrtimer_start_range_ns() calls switch_hrtimer_base() before the
    new expiry time is set. So the sanity check in hrtimer_check_target()
    is operating on stale or even uninitialized data.

    Update expiry time before calling switch_hrtimer_base().

    [ tglx: Rewrote changelog once again ]

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Cc: linaro-networking@linaro.org
    Cc: fweisbec@gmail.com
    Cc: arvind.chauhan@arm.com
    Link: http://lkml.kernel.org/r/81999e148745fc51bbcd0615823fbab9b2e87e23.1399882253.git.viresh.kumar@linaro.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Viresh Kumar
     

10 May, 2014

1 commit

  • Pull x86 fixes from Peter Anvin:
    "A somewhat unpleasantly large collection of small fixes. The big ones
    are the __visible tree sweep and a fix for 'earlyprintk=efi,keep'. It
    was using __init functions with predictably suboptimal results.

    Another key fix is a build fix which would produce output that simply
    would not decompress correctly in some configuration, due to the
    existing Makefiles picking up an unfortunate local label and mistaking
    it for the global symbol _end.

    Additional fixes include the handling of 64-bit numbers when setting
    the vdso data page (a latent bug which became manifest when i386
    started exporting a vdso with time functions), a fix to the new MSR
    manipulation accessors which would cause features to not get properly
    unblocked, a build fix for 32-bit userland, and a few new platform
    quirks"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, vdso, time: Cast tv_nsec to u64 for proper shifting in update_vsyscall()
    x86: Fix typo in MSR_IA32_MISC_ENABLE_LIMIT_CPUID macro
    x86: Fix typo preventing msr_set/clear_bit from having an effect
    x86/intel: Add quirk to disable HPET for the Baytrail platform
    x86/hpet: Make boot_hpet_disable extern
    x86-64, build: Fix stack protector Makefile breakage with 32-bit userland
    x86/reboot: Add reboot quirk for Certec BPC600
    asmlinkage: Add explicit __visible to drivers/*, lib/*, kernel/*
    asmlinkage, x86: Add explicit __visible to arch/x86/*
    asmlinkage: Revert "lto: Make asmlinkage __visible"
    x86, build: Don't get confused by local symbols
    x86/efi: earlyprintk=efi,keep fix

    Linus Torvalds
     

09 May, 2014

1 commit

  • …l/git/rostedt/linux-trace

    Pull tracing fixes from Steven Rostedt:
    "This contains two fixes.

    The first is a long standing bug that causes bogus data to show up in
    the refcnt field of the module_refcnt tracepoint. It was introduced
    by a merge conflict resolution back in 2.6.35-rc days.

    The result should be 'refcnt = incs - decs', but instead it did
    'refcnt = incs + decs'.

    The second fix is to a bug that was introduced in this merge window
    that allowed for a tracepoint funcs pointer to be used after it was
    freed. Moving the location of where the probes are released solved
    the problem"

    * tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracepoint: Fix use of tracepoint funcs after rcu free
    trace: module: Maintain a valid user count

    Linus Torvalds
     

08 May, 2014

1 commit

  • Commit de7b2973903c "tracepoint: Use struct pointer instead of name hash
    for reg/unreg tracepoints" introduces a use after free by calling
    release_probes on the old struct tracepoint array before the newly
    allocated array is published with rcu_assign_pointer. There is a race
    window where tracepoints (RCU readers) can perform a
    "use-after-grace-period-after-free", which shows up as a GPF in
    stress-tests.

    Link: http://lkml.kernel.org/r/53698021.5020108@oracle.com
    Link: http://lkml.kernel.org/p/1399549669-25465-1-git-send-email-mathieu.desnoyers@efficios.com

    Reported-by: Sasha Levin
    CC: Oleg Nesterov
    CC: Dave Jones
    Fixes: de7b2973903c "tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints"
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Steven Rostedt

    Mathieu Desnoyers
     

07 May, 2014

9 commits

  • Also initialize the per-sd variables for newidle load balancing
    in sd_numa_init().

    Signed-off-by: Jason Low
    Acked-by: morten.rasmussen@arm.com
    Cc: daniel.lezcano@linaro.org
    Cc: alex.shi@linaro.org
    Cc: preeti@linux.vnet.ibm.com
    Cc: efault@gmx.de
    Cc: vincent.guittot@linaro.org
    Cc: aswin@hp.com
    Cc: chegu_vinod@hp.com
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1398303035-18255-3-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • The following commit:

    e5fc66119ec9 ("sched: Fix race in idle_balance()")

    can potentially cause rq->max_idle_balance_cost to not be updated,
    even when load_balance(NEWLY_IDLE) is attempted and the per-sd
    max cost value is updated.

    Preeti noticed a similar issue with updating rq->next_balance.

    In this patch, we fix this by making sure we still check/update those values
    even if a task gets enqueued while browsing the domains.

    Signed-off-by: Jason Low
    Reviewed-by: Preeti U Murthy
    Signed-off-by: Peter Zijlstra
    Cc: morten.rasmussen@arm.com
    Cc: aswin@hp.com
    Cc: daniel.lezcano@linaro.org
    Cc: alex.shi@linaro.org
    Cc: efault@gmx.de
    Cc: vincent.guittot@linaro.org
    Link: http://lkml.kernel.org/r/1398725155-7591-2-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • Tim wrote:

    "The current code will call pick_next_task_fair a second time in the
    slow path if we did not pull any task in our first try. This is
    really unnecessary as we already know no task can be pulled and it
    doubles the delay for the cpu to enter idle.

    We instrumented some network workloads and that saw that
    pick_next_task_fair is frequently called twice before a cpu enters
    idle. The call to pick_next_task_fair can add non trivial latency as
    it calls load_balance which runs find_busiest_group on an hierarchy of
    sched domains spanning the cpus for a large system. For some 4 socket
    systems, we saw almost 0.25 msec spent per call of pick_next_task_fair
    before a cpu can be idled."

    Optimize the second call away for the common case and document the
    dependency.

    Reported-by: Tim Chen
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Len Brown
    Link: http://lkml.kernel.org/r/20140424100047.GP11096@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The check at the beginning of cpupri_find() makes sure that the task_pri
    variable does not exceed the cp->pri_to_cpu array length. But that length
    is CPUPRI_NR_PRIORITIES not MAX_RT_PRIO, where it will miss the last two
    priorities in that array.

    As task_pri is computed from convert_prio() which should never be bigger
    than CPUPRI_NR_PRIORITIES, if the check should cause a panic if it is
    hit.

    Reported-by: Mike Galbraith
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1397015410.5212.13.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Steven Rostedt (Red Hat)
     
  • Free cpudl->free_cpus allocated in cpudl_init().

    Signed-off-by: Li Zefan
    Acked-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Cc: # 3.14+
    Link: http://lkml.kernel.org/r/534F36CE.2000409@huawei.com
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • yield_task_dl() is broken:

    o it forces current to be throttled setting its runtime to zero;
    o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
    will queue it back with proper parameters at replenish time.

    Unfortunately, dl_task_timer() has this check at the very beginning:

    if (!dl_task(p) || dl_se->dl_new)
    goto unlock;

    So, it just bails out and the task is never replenished. It actually
    yielded forever.

    To fix this, introduce a new flag indicating that the task properly yielded
    the CPU before its current runtime expired. While this is a little overdoing
    at the moment, the flag would be useful in the future to discriminate between
    "good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
    and "bad" jobs (for which dl_throttled task has been set) that needed to be
    stopped.

    Reported-by: yjay.kim
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Russell reported, that irqtime_account_idle_ticks() takes ages due to:

    for (i = 0; i < ticks; i++)
    irqtime_account_process_tick(current, 0, rq);

    It's sad, that this code was written way _AFTER_ the NOHZ idle
    functionality was available. I charge myself guitly for not paying
    attention when that crap got merged with commit abb74cefa ("sched:
    Export ns irqtimes through /proc/stat")

    So instead of looping nr_ticks times just apply the whole thing at
    once.

    As a side note: The whole cputime_t vs. u64 business in that context
    wants to be cleaned up as well. There is no point in having all these
    back and forth conversions. Lets standardise on u64 nsec for all
    kernel internal accounting and be done with it. Everything else does
    not make sense at all for fine grained accounting. Frederic, can you
    please take care of that?

    Reported-by: Russell King
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra
    Cc: Venkatesh Pallipadi
    Cc: Shaun Ruffell
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • perf_pin_task_context() can return NULL but perf_event_init_context()
    assumes it will not, correct this.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/20140505171428.GU26782@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When removing a (sibling) event we do:

    raw_spin_lock_irq(&ctx->lock);
    perf_group_detach(event);
    raw_spin_unlock_irq(&ctx->lock);

    perf_remove_from_context(event);
    raw_spin_lock_irq(&ctx->lock);
    ...
    raw_spin_unlock_irq(&ctx->lock);

    Now, assuming the event is a sibling, it will be 'unreachable' for
    things like ctx_sched_out() because that iterates the
    groups->siblings, and we just unhooked the sibling.

    So, if during we get ctx_sched_out(), it will miss the event
    and not call event_sched_out() on it, leaving it programmed on the
    PMU.

    The subsequent perf_remove_from_context() call will find the ctx is
    inactive and only call list_del_event() to remove the event from all
    other lists.

    Hereafter we can proceed to free the event; while still programmed!

    Close this hole by moving perf_group_detach() inside the same
    ctx->lock region(s) perf_remove_from_context() has.

    The condition on inherited events only in __perf_event_exit_task() is
    likely complete crap because non-inherited events are part of groups
    too and we're tearing down just the same. But leave that for another
    patch.

    Most-likely-Fixes: e03a9a55b4e ("perf: Change close() semantics for group events")
    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Much-staring-at-traces-by: Vince Weaver
    Much-staring-at-traces-by: Thomas Gleixner
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140505093124.GN17778@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 May, 2014

1 commit