05 Sep, 2018

1 commit

  • commit cb9d7fd51d9fbb329d182423bd7b92d0f8cb0e01 upstream.

    Some architectures need to use stop_machine() to patch functions for
    ftrace, and the assumption is that the stopped CPUs do not make function
    calls to traceable functions when they are in the stopped state.

    Commit ce4f06dcbb5d ("stop_machine: Touch_nmi_watchdog() after
    MULTI_STOP_PREPARE") added calls to the watchdog touch functions from
    the stopped CPUs and those functions lack notrace annotations. This
    leads to crashes when enabling/disabling ftrace on ARM kernels built
    with the Thumb-2 instruction set.

    Fix it by adding the necessary notrace annotations.

    Fixes: ce4f06dcbb5d ("stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE")
    Signed-off-by: Vincent Whitchurch
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: oleg@redhat.com
    Cc: tj@kernel.org
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180821152507.18313-1-vincent.whitchurch@axis.com
    Signed-off-by: Greg Kroah-Hartman

    Vincent Whitchurch
     

30 May, 2018

1 commit

  • [ Upstream commit 537f4146c53c95aac977852b371bafb9c6755ee1 ]

    Never directly free @dev after calling device_register(), even
    if it returned an error! Always use put_device() to give up the
    reference initialized in this function instead.

    Signed-off-by: Arvind Yadav
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Arvind Yadav
     

15 Mar, 2018

1 commit

  • commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 upstream.

    Introduce a helper to retrieve the current task's work struct if it is
    a workqueue worker.

    This allows us to fix a long-standing deadlock in several DRM drivers
    wherein the ->runtime_suspend callback waits for a specific worker to
    finish and that worker in turn calls a function which waits for runtime
    suspend to finish. That function is invoked from multiple call sites
    and waiting for runtime suspend to finish is the correct thing to do
    except if it's executing in the context of the worker.

    Cc: Lai Jiangshan
    Cc: Dave Airlie
    Cc: Ben Skeggs
    Cc: Alex Deucher
    Acked-by: Tejun Heo
    Reviewed-by: Lyude Paul
    Signed-off-by: Lukas Wunner
    Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de
    Signed-off-by: Greg Kroah-Hartman

    Lukas Wunner
     

24 Jan, 2018

1 commit

  • commit 62635ea8c18f0f62df4cc58379e4f1d33afd5801 upstream.

    show_workqueue_state() can print out a lot of messages while being in
    atomic context, e.g. sysrq-t -> show_workqueue_state(). If the console
    device is slow it may end up triggering NMI hard lockup watchdog.

    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     

10 Oct, 2017

1 commit

  • Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
    lockdep:

    [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
    [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
    [ 1270.473240] -----------------------------------------------------
    [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [] __mutex_unlock_slowpath+0xa2/0x280
    [ 1270.474994]
    [ 1270.474994] and this task is already holding:
    [ 1270.475440] (&pool->lock/1){-.-.}, at: [] worker_thread+0x366/0x3c0
    [ 1270.476046] which would create a new lock dependency:
    [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
    [ 1270.476949]
    [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
    [ 1270.477553] (&pool->lock/1){-.-.}
    ...
    [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
    [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.}
    ...
    [ 1270.494735] Possible interrupt unsafe locking scenario:
    [ 1270.494735]
    [ 1270.495250] CPU0 CPU1
    [ 1270.495600] ---- ----
    [ 1270.495947] lock(&(&lock->wait_lock)->rlock);
    [ 1270.496295] local_irq_disable();
    [ 1270.496753] lock(&pool->lock/1);
    [ 1270.497205] lock(&(&lock->wait_lock)->rlock);
    [ 1270.497744]
    [ 1270.497948] lock(&pool->lock/1);

    , which will cause a irq inversion deadlock if the above lock scenario
    happens.

    The root cause of this safe -> unsafe lock order is the
    mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
    held.

    Unlocking mutex while holding an irq spinlock was never safe and this
    problem has been around forever but it never got noticed because the
    only time the mutex is usually trylocked while holding irqlock making
    actual failures very unlikely and lockdep annotation missed the
    condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
    lockdep_assert_held() fail").

    Using mutex for pool->manager_arb has always been a bit of stretch.
    It primarily is an mechanism to arbitrate managership between workers
    which can easily be done with a pool flag. The only reason it became
    a mutex is that pool destruction path wants to exclude parallel
    managing operations.

    This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
    and make the destruction path wait for the current manager on a wait
    queue.

    v2: Drop unnecessary flag clearing before pool destruction as
    suggested by Boqun.

    Signed-off-by: Tejun Heo
    Reported-by: Josef Bacik
    Reviewed-by: Lai Jiangshan
    Cc: Peter Zijlstra
    Cc: Boqun Feng
    Cc: stable@vger.kernel.org

    Tejun Heo
     

07 Sep, 2017

1 commit

  • Pull workqueue updates from Tejun Heo:
    "Nothing major. I introduced a flag collsion bug during v4.13 cycle
    which is fixed in this pull request. Fortunately, the flag is for
    debugging / verification and the bug isn't critical"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Fix flag collision
    workqueue: Use TASK_IDLE
    workqueue: fix path to documentation
    workqueue: doc change for ST behavior on NUMA systems

    Linus Torvalds
     

05 Sep, 2017

1 commit


29 Aug, 2017

1 commit

  • Where XHLOCK_{SOFT,HARD} are save/restore points in the xhlocks[] to
    ensure the temporal IRQ events don't interact with task state, the
    XHLOCK_PROC is a fundament different beast that just happens to share
    the interface.

    The purpose of XHLOCK_PROC is to annotate independent execution inside
    one task. For example workqueues, each work should appear to run in its
    own 'pristine' 'task'.

    Remove XHLOCK_PROC in favour of its own interface to avoid confusion.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: david@fromorbit.com
    Cc: johannes@sipsolutions.net
    Cc: kernel-team@lge.com
    Cc: oleg@redhat.com
    Cc: tj@kernel.org
    Link: http://lkml.kernel.org/r/20170829085939.ggmb6xiohw67micb@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Aug, 2017

2 commits

  • The new completion/crossrelease annotations interact unfavourable with
    the extant flush_work()/flush_workqueue() annotations.

    The problem is that when a single work class does:

    wait_for_completion(&C)

    and

    complete(&C)

    in different executions, we'll build dependencies like:

    lock_map_acquire(W)
    complete_acquire(C)

    and

    lock_map_acquire(W)
    complete_release(C)

    which results in the dependency chain: W->C->W, which lockdep thinks
    spells deadlock, even though there is no deadlock potential since
    works are ran concurrently.

    One possibility would be to change the work 'lock' to recursive-read,
    but that would mean hitting a lockdep limitation on recursive locks.
    Also, unconditinoally switching to recursive-read here would fail to
    detect the actual deadlock on single-threaded workqueues, which do
    have a problem with this.

    For now, forcefully disregard these locks for crossrelease.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: byungchul.park@lge.com
    Cc: david@fromorbit.com
    Cc: johannes@sipsolutions.net
    Cc: oleg@redhat.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The flush_work() annotation as introduced by commit:

    e159489baa71 ("workqueue: relax lockdep annotation on flush_work()")

    hits on the lockdep problem with recursive read locks.

    The situation as described is:

    Work W1: Work W2: Task:

    ARR(Q) ARR(Q) flush_workqueue(Q)
    A(W1) A(W2) A(Q)
    flush_work(W2) R(Q)
    A(W2)
    R(W2)
    if (special)
    A(Q)
    else
    ARR(Q)
    R(Q)

    where: A - acquire, ARR - acquire-read-recursive, R - release.

    Where under 'special' conditions we want to trigger a lock recursion
    deadlock, but otherwise allow the flush_work(). The allowing is done
    by using recursive read locks (ARR), but lockdep is broken for
    recursive stuff.

    However, there appears to be no need to acquire the lock if we're not
    'special', so if we remove the 'else' clause things become much
    simpler and no longer need the recursion thing at all.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: byungchul.park@lge.com
    Cc: david@fromorbit.com
    Cc: johannes@sipsolutions.net
    Cc: oleg@redhat.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Aug, 2017

1 commit

  • Workqueues don't use signals, it (ab)uses TASK_INTERRUPTIBLE to avoid
    increasing the loadavg numbers. We've 'recently' introduced TASK_IDLE
    for this case:

    80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")

    use it.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Tejun Heo

    Peter Zijlstra
     

17 Aug, 2017

1 commit

  • With the new lockdep crossrelease feature, which checks completions usage,
    a false positive is reported in the workqueue code:

    > Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
    > Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
    > Task C : acquired of lock#3 -> wait for completion of barr->done
    > (Task C is in lru_add_drain_all_cpuslocked())
    > Worker D : wait for wfc.work to be released -> will complete barr->done

    Such a dead lock can not happen because Task C's barr->done and Worker D's
    barr->done can not be the same instance.

    The reason of this false positive is we initialize all wq_barrier::done
    at insert_wq_barrier() via init_completion(), which makes them belong to
    the same lock class, therefore, impossible circles are reported.

    To fix this, explicitly initialize the lockdep map for wq_barrier::done
    in insert_wq_barrier(), so that the lock class key of wq_barrier::done
    is a subkey of the corresponding work_struct, as a result we won't build
    a dependency between a wq_barrier with a unrelated work, and we can
    differ wq barriers based on the related works, so the false positive
    above is avoided.

    Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
    to make the code simple and away from unnecessary #ifdefs.

    Reported-by: Ingo Molnar
    Signed-off-by: Boqun Feng
    Cc: Byungchul Park
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com
    Signed-off-by: Ingo Molnar

    Boqun Feng
     

10 Aug, 2017

1 commit

  • Lockdep is a runtime locking correctness validator that detects and
    reports a deadlock or its possibility by checking dependencies between
    locks. It's useful since it does not report just an actual deadlock but
    also the possibility of a deadlock that has not actually happened yet.
    That enables problems to be fixed before they affect real systems.

    However, this facility is only applicable to typical locks, such as
    spinlocks and mutexes, which are normally released within the context in
    which they were acquired. However, synchronization primitives like page
    locks or completions, which are allowed to be released in any context,
    also create dependencies and can cause a deadlock.

    So lockdep should track these locks to do a better job. The 'crossrelease'
    implementation makes these primitives also be tracked.

    Signed-off-by: Byungchul Park
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: boqun.feng@gmail.com
    Cc: kernel-team@lge.com
    Cc: kirill@shutemov.name
    Cc: npiggin@gmail.com
    Cc: walken@google.com
    Cc: willy@infradead.org
    Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.com
    Signed-off-by: Ingo Molnar

    Byungchul Park
     

07 Aug, 2017

1 commit


28 Jul, 2017

1 commit

  • There is an underlying assumption/trade-off in many layers of the Linux
    system that CPU node mapping is static. This is despite the presence
    of features like NUMA and 'hotplug' that support the dynamic addition/
    removal of fundamental system resources like CPUs and memory. PowerPC
    systems, however, do provide extensive features for the dynamic change
    of resources available to a system.

    Currently, there is little or no synchronization protection around the
    updating of the CPU node mapping, and the export/update of this
    information for other layers / modules. In systems which can change
    this mapping during 'hotplug', like PowerPC, the information is changing
    underneath all layers that might reference it.

    This patch attempts to ensure that a valid, usable cpumask attribute
    is used by the workqueue infrastructure when setting up new resource
    pools. It prevents a crash that has been observed when an 'empty'
    cpumask is passed along to the worker/task scheduling code. It is
    intended as a temporary workaround until a more fundamental review and
    correction of the issue can be done.

    [With additions to the patch provided by Tejun Hao ]

    Signed-off-by: Michael Bringmann
    Signed-off-by: Tejun Heo

    Michael Bringmann
     

26 Jul, 2017

1 commit

  • 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
    ordered") automatically enabled ordered attribute for unbound
    workqueues w/ max_active == 1. Because ordered workqueues reject
    max_active and some attribute changes, this implicit ordered mode
    broke cases where the user creates an unbound workqueue w/ max_active
    == 1 and later explicitly changes the related attributes.

    This patch distinguishes explicit and implicit ordered setting and
    overrides from attribute changes if implict.

    Signed-off-by: Tejun Heo
    Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")

    Tejun Heo
     

19 Jul, 2017

1 commit

  • The combination of WQ_UNBOUND and max_active == 1 used to imply
    ordered execution. After NUMA affinity 4c16bd327c74 ("workqueue:
    implement NUMA affinity for unbound workqueues"), this is no longer
    true due to per-node worker pools.

    While the right way to create an ordered workqueue is
    alloc_ordered_workqueue(), the documentation has been misleading for a
    long time and people do use WQ_UNBOUND and max_active == 1 for ordered
    workqueues which can lead to subtle bugs which are very difficult to
    trigger.

    It's unlikely that we'd see noticeable performance impact by enforcing
    ordering on WQ_UNBOUND / max_active == 1 workqueues. Let's
    automatically set __WQ_ORDERED for those workqueues.

    Signed-off-by: Tejun Heo
    Reported-by: Christoph Hellwig
    Reported-by: Alexei Potashnik
    Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues")
    Cc: stable@vger.kernel.org # v3.10+

    Tejun Heo
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

02 May, 2017

2 commits

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - another round of rq-clock handling debugging, robustization and
    fixes

    - PELT accounting improvements

    - CPU hotplug related ->cpus_allowed affinity handling fixes all
    around the tree

    - ... plus misc fixes, cleanups and updates"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
    sched/x86: Update reschedule warning text
    crypto: N2 - Replace racy task affinity logic
    cpufreq/sparc-us2e: Replace racy task affinity logic
    cpufreq/sparc-us3: Replace racy task affinity logic
    cpufreq/sh: Replace racy task affinity logic
    cpufreq/ia64: Replace racy task affinity logic
    ACPI/processor: Replace racy task affinity logic
    ACPI/processor: Fix error handling in __acpi_processor_start()
    sparc/sysfs: Replace racy task affinity logic
    powerpc/smp: Replace open coded task affinity logic
    ia64/sn/hwperf: Replace racy task affinity logic
    ia64/salinfo: Replace racy task affinity logic
    workqueue: Provide work_on_cpu_safe()
    ia64/topology: Remove cpus_allowed manipulation
    sched/fair: Move the PELT constants into a generated header
    sched/fair: Increase PELT accuracy for small tasks
    sched/fair: Fix comments
    sched/Documentation: Add 'sched-pelt' tool
    sched/fair: Fix corner case in __accumulate_sum()
    sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
    ...

    Linus Torvalds
     
  • Pull workqueue update from Tejun Heo:
    "One trivial patch to use setup_deferrable_timer() instead of
    open-coding the initialization"

    * 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: use setup_deferrable_timer

    Linus Torvalds
     

15 Apr, 2017

1 commit

  • work_on_cpu() is not protected against CPU hotplug. For code which requires
    to be either executed on an online CPU or to fail if the CPU is not
    available the callsite would have to protect against CPU hotplug.

    Provide a function which does get/put_online_cpus() around the call to
    work_on_cpu() and fails the call with -ENODEV if the target CPU is not
    online.

    Preparatory patch to convert several racy task affinity manipulations.

    Signed-off-by: Thomas Gleixner
    Acked-by: Tejun Heo
    Cc: Fenghua Yu
    Cc: Tony Luck
    Cc: Herbert Xu
    Cc: "Rafael J. Wysocki"
    Cc: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Sebastian Siewior
    Cc: Lai Jiangshan
    Cc: Viresh Kumar
    Cc: Michael Ellerman
    Cc: "David S. Miller"
    Cc: Len Brown
    Link: http://lkml.kernel.org/r/20170412201042.262610721@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

07 Mar, 2017

2 commits


10 Feb, 2017

1 commit

  • Currently CONFIG_TIMER_STATS exposes process information across namespaces:

    kernel/time/timer_list.c print_timer():

    SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);

    /proc/timer_list:

    #11: , hrtimer_wakeup, S:01, do_nanosleep, cron/2570

    Given that the tracer can give the same information, this patch entirely
    removes CONFIG_TIMER_STATS.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Kees Cook
    Acked-by: John Stultz
    Cc: Nicolas Pitre
    Cc: linux-doc@vger.kernel.org
    Cc: Lai Jiangshan
    Cc: Shuah Khan
    Cc: Xing Gao
    Cc: Jonathan Corbet
    Cc: Jessica Frazelle
    Cc: kernel-hardening@lists.openwall.com
    Cc: Nicolas Iooss
    Cc: "Paul E. McKenney"
    Cc: Petr Mladek
    Cc: Richard Cochran
    Cc: Tejun Heo
    Cc: Michal Marek
    Cc: Josh Poimboeuf
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Olof Johansson
    Cc: Andrew Morton
    Cc: linux-api@vger.kernel.org
    Cc: Arjan van de Ven
    Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
    Signed-off-by: Thomas Gleixner

    Kees Cook
     

20 Oct, 2016

2 commits

  • Tejun Heo
     
  • While splitting up workqueue initialization into two parts,
    ac8f73400782 ("workqueue: make workqueue available early during boot")
    put wq_numa_init() into workqueue_init_early(). Unfortunately, on
    some archs including power and arm64, cpu to node mapping isn't yet
    established by the time the early init is called leading to incorrect
    NUMA initialization and subsequently the following oops due to zero
    cpumask on node-specific unbound pools.

    Unable to handle kernel paging request for data at address 0x00000038
    Faulting instruction address: 0xc0000000000fc0cc
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
    task: c0000007f5400000 task.stack: c000001ffc084000
    NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
    REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
    MSR: 9000000002009033 CR: 48000424 XER: 00000000
    CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
    GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
    GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
    GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
    GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
    GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
    GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
    GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
    NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
    LR [c0000000000ed928] activate_task+0x78/0xe0
    Call Trace:
    [c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
    [c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
    [c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
    [c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
    [c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
    [c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
    [c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
    [c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
    [c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
    Instruction dump:
    62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
    60420000 72490021 ebfe0150 2f890001 419e0de0 7fbee840 419e0e58
    ---[ end trace 0000000000000000 ]---

    Fix it by moving wq_numa_init() to workqueue_init(). As this means
    that the early intialization may not have full NUMA info for per-cpu
    pools and ignores NUMA affinity for unbound pools, fix them up from
    workqueue_init() after wq_numa_init().

    Signed-off-by: Tejun Heo
    Reported-by: Michael Ellerman
    Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
    Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
    Signed-off-by: Tejun Heo

    Tejun Heo
     

12 Oct, 2016

1 commit

  • Patch series "kthread: Kthread worker API improvements"

    The intention of this patchset is to make it easier to manipulate and
    maintain kthreads. Especially, I want to replace all the custom main
    cycles with a generic one. Also I want to make the kthreads sleep in a
    consistent state in a common place when there is no work.

    This patch (of 11):

    A good practice is to prefix the names of functions by the name of the
    subsystem.

    This patch fixes the name of probe_kthread_data(). The other wrong
    functions names are part of the kthread worker API and will be fixed
    separately.

    Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Suggested-by: Andrew Morton
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

18 Sep, 2016

2 commits

  • keventd_up() no longer has in-kernel users. Remove it and make
    wq_online static.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Workqueue is currently initialized in an early init call; however,
    there are cases where early boot code has to be split and reordered to
    come after workqueue initialization or the same code path which makes
    use of workqueues is used both before workqueue initailization and
    after. The latter cases have to gate workqueue usages with
    keventd_up() tests, which is nasty and easy to get wrong.

    Workqueue usages have become widespread and it'd be a lot more
    convenient if it can be used very early from boot. This patch splits
    workqueue initialization into two steps. workqueue_init_early() which
    sets up the basic data structures so that workqueues can be created
    and work items queued, and workqueue_init() which actually brings up
    workqueues online and starts executing queued work items. The former
    step can be done very early during boot once memory allocation,
    cpumasks and idr are initialized. The latter right after kthreads
    become available.

    This allows work item queueing and canceling from very early boot
    which is what most of these use cases want.

    * As systemd_wq being initialized doesn't indicate that workqueue is
    fully online anymore, update keventd_up() to test wq_online instead.
    The follow-up patches will get rid of all its usages and the
    function itself.

    * Flushing doesn't make sense before workqueue is fully initialized.
    The flush functions trigger WARN and return immediately before fully
    online.

    * Work items are never in-flight before fully online. Canceling can
    always succeed by skipping the flush step.

    * Some code paths can no longer assume to be called with irq enabled
    as irq is disabled during early boot. Use irqsave/restore
    operations instead.

    v2: Watchdog init, which requires timer to be running, moved from
    workqueue_init_early() to workqueue_init().

    Signed-off-by: Tejun Heo
    Suggested-by: Linus Torvalds
    Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com

    Tejun Heo
     

16 Sep, 2016

1 commit

  • destroy_workqueue() performs a number of sanity checks to ensure that
    the workqueue is empty before proceeding with destruction. However,
    it's not always easy to tell what's going on just from the warning
    message. Let's dump workqueue state after sanity check failures to
    help debugging.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/CACT4Y+Zs6vkjHo9qHb4TrEiz3S4+quvvVQ9VWvj2Mx6pETGb9Q@mail.gmail.com
    Cc: Dmitry Vyukov

    Tejun Heo
     

29 Aug, 2016

1 commit


30 Jul, 2016

1 commit

  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

25 Jul, 2016

1 commit

  • * pm-sleep:
    PM / hibernate: Introduce test_resume mode for hibernation
    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
    PM / hibernate: Image data protection during restoration
    PM / hibernate: Add missing braces in __register_nosave_region()
    PM / hibernate: Clean up comments in snapshot.c
    PM / hibernate: Clean up function headers in snapshot.c
    PM / hibernate: Add missing braces in hibernate_setup()
    PM / hibernate: Recycle safe pages after image restoration
    PM / hibernate: Simplify mark_unsafe_pages()
    PM / hibernate: Do not free preallocated safe pages during image restore
    PM / suspend: show workqueue state in suspend flow
    PM / sleep: make PM notifiers called symmetrically
    PM / sleep: Make pm_prepare_console() return void
    PM / Hibernate: Don't let kasan instrument snapshot.c

    * pm-tools:
    PM / tools: scripts: AnalyzeSuspend v4.2
    tools/turbostat: allow user to alter DESTDIR and PREFIX

    Rafael J. Wysocki
     

14 Jul, 2016

1 commit

  • Get rid of the prio ordering of the separate notifiers and use a proper state
    callback pair.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Andrzej Siewior
    Acked-by: Tejun Heo
    Cc: Andrew Morton
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Nicolas Iooss
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rasmus Villemoes
    Cc: Rusty Russell
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153335.197083890@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

02 Jul, 2016

1 commit


17 Jun, 2016

1 commit

  • With commit e9d867a67fd03ccc ("sched: Allow per-cpu kernel threads to
    run on online && !active"), __set_cpus_allowed_ptr() expects that only
    strict per-cpu kernel threads can have affinity to an online CPU which
    is not yet active.

    This assumption is currently broken in the CPU_ONLINE notification
    handler for the workqueues where restore_unbound_workers_cpumask()
    calls set_cpus_allowed_ptr() when the first cpu in the unbound
    worker's pool->attr->cpumask comes online. Since
    set_cpus_allowed_ptr() is called with pool->attr->cpumask in which
    only one CPU is online which is not yet active, we get the following
    WARN_ON during an CPU online operation.

    ------------[ cut here ]------------
    WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166
    __set_cpus_allowed_ptr+0x228/0x2e0
    Modules linked in:
    CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4

    Call Trace:
    [c000000f273ff920] [c00000000010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable)
    [c000000f273ffac0] [c0000000000ed4b0] workqueue_cpu_up_callback+0x2c0/0x470
    [c000000f273ffb70] [c0000000000f5c58] notifier_call_chain+0x98/0x100
    [c000000f273ffbc0] [c0000000000c5ed0] __cpu_notify+0x70/0xe0
    [c000000f273ffc00] [c0000000000c6028] notify_online+0x38/0x50
    [c000000f273ffc30] [c0000000000c5214] cpuhp_invoke_callback+0x84/0x250
    [c000000f273ffc90] [c0000000000c562c] cpuhp_up_callbacks+0x5c/0x120
    [c000000f273ffce0] [c0000000000c64d4] cpuhp_thread_fun+0x184/0x1c0
    [c000000f273ffd20] [c0000000000fa050] smpboot_thread_fn+0x290/0x2a0
    [c000000f273ffd80] [c0000000000f45b0] kthread+0x110/0x130
    [c000000f273ffe30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
    ---[ end trace 00f1456578b2a3b2 ]---

    This patch fixes this by limiting the mask to the intersection of
    the pool affinity and online CPUs.

    Changelog-cribbed-from: Gautham R. Shenoy
    Reported-by: Abdul Haleem
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Tejun Heo

    Peter Zijlstra
     

20 May, 2016

2 commits

  • When activating a static object we need make sure that the object is
    tracked in the object tracker. If it is a non-static object then the
    activation is illegal.

    In previous implementation, each subsystem need take care of this in
    their fixup callbacks. Actually we can put it into debugobjects core.
    Thus we can save duplicated code, and have *pure* fixup callbacks.

    To achieve this, a new callback "is_static_object" is introduced to let
    the type specific code decide whether a object is static or not. If
    yes, we take it into object tracker, otherwise give warning and invoke
    fixup callback.

    This change has paassed debugobjects selftest, and I also do some test
    with all debugobjects supports enabled.

    At last, I have a concern about the fixups that can it change the object
    which is in incorrect state on fixup? Because the 'addr' may not point
    to any valid object if a non-static object is not tracked. Then Change
    such object can overwrite someone's memory and cause unexpected
    behaviour. For example, the timer_fixup_activate bind timer to function
    stub_timer.

    Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
    [changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
    Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    change (debugobjects: make fixup functions return bool instead of int)

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     

14 May, 2016

1 commit

  • Pull workqueue fix from Tejun Heo:
    "CPU hotplug callbacks can invoke DOWN_FAILED w/o preceding
    DOWN_PREPARE which can trigger a WARN_ON() in workqueue.

    The bug has been there for a very long time. It only triggers if CPU
    down fails at a specific point and I don't think it has adverse
    effects other than the warning messages. The fix is very low impact"

    * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix rebind bound workers warning

    Linus Torvalds
     

13 May, 2016

1 commit

  • ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
    Modules linked in:
    CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
    Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
    0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
    0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
    ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
    Call Trace:
    dump_stack+0x89/0xd4
    __warn+0xfd/0x120
    warn_slowpath_null+0x1d/0x20
    rebind_workers+0x1c0/0x1d0
    workqueue_cpu_up_callback+0xf5/0x1d0
    notifier_call_chain+0x64/0x90
    ? trace_hardirqs_on_caller+0xf2/0x220
    ? notify_prepare+0x80/0x80
    __raw_notifier_call_chain+0xe/0x10
    __cpu_notify+0x35/0x50
    notify_down_prepare+0x5e/0x80
    ? notify_prepare+0x80/0x80
    cpuhp_invoke_callback+0x73/0x330
    ? __schedule+0x33e/0x8a0
    cpuhp_down_callbacks+0x51/0xc0
    cpuhp_thread_fun+0xc1/0xf0
    smpboot_thread_fn+0x159/0x2a0
    ? smpboot_create_threads+0x80/0x80
    kthread+0xef/0x110
    ? wait_for_completion+0xf0/0x120
    ? schedule_tail+0x35/0xf0
    ret_from_fork+0x22/0x50
    ? __init_kthread_worker+0x70/0x70
    ---[ end trace eb12ae47d2382d8f ]---
    notify_down_prepare: attempt to take down CPU 0 failed

    This bug can be reproduced by below config w/ nohz_full= all cpus:

    CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
    CONFIG_DEBUG_HOTPLUG_CPU0=y
    CONFIG_NO_HZ_FULL=y

    As Thomas pointed out:

    | If a down prepare callback fails, then DOWN_FAILED is invoked for all
    | callbacks which have successfully executed DOWN_PREPARE.
    |
    | But, workqueue has actually two notifiers. One which handles
    | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
    |
    | Now look at the priorities of those callbacks:
    |
    | CPU_PRI_WORKQUEUE_UP = 5
    | CPU_PRI_WORKQUEUE_DOWN = -5
    |
    | So the call order on DOWN_PREPARE is:
    |
    | CB 1
    | CB ...
    | CB workqueue_up() -> Ignores DOWN_PREPARE
    | CB ...
    | CB X ---> Fails
    |
    | So we call up to CB X with DOWN_FAILED
    |
    | CB 1
    | CB ...
    | CB workqueue_up() -> Handles DOWN_FAILED
    | CB ...
    | CB X-1
    |
    | So the problem is that the workqueue stuff handles DOWN_FAILED in the up
    | callback, while it should do it in the down callback. Which is not a good idea
    | either because it wants to be called early on rollback...
    |
    | Brilliant stuff, isn't it? The hotplug rework will solve this problem because
    | the callbacks become symetric, but for the existing mess, we need some
    | workaround in the workqueue code.

    The boot CPU handles housekeeping duty(unbound timers, workqueues,
    timekeeping, ...) on behalf of full dynticks CPUs. It must remain
    online when nohz full is enabled. There is a priority set to every
    notifier_blocks:

    workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down

    So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
    notifier_blocks behind tick_nohz_cpu_down will not be called any
    more, which leads to workers are actually not unbound. Then hotplug
    state machine will fallback to undo and online cpu 0 again. Workers
    will be rebound unconditionally even if they are not unbound and
    trigger the warning in this progress.

    This patch fix it by catching !DISASSOCIATED to avoid rebind bound
    workers.

    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frédéric Weisbecker
    Cc: stable@vger.kernel.org
    Suggested-by: Lai Jiangshan
    Signed-off-by: Wanpeng Li

    Wanpeng Li