20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

02 May, 2017

2 commits

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - another round of rq-clock handling debugging, robustization and
    fixes

    - PELT accounting improvements

    - CPU hotplug related ->cpus_allowed affinity handling fixes all
    around the tree

    - ... plus misc fixes, cleanups and updates"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
    sched/x86: Update reschedule warning text
    crypto: N2 - Replace racy task affinity logic
    cpufreq/sparc-us2e: Replace racy task affinity logic
    cpufreq/sparc-us3: Replace racy task affinity logic
    cpufreq/sh: Replace racy task affinity logic
    cpufreq/ia64: Replace racy task affinity logic
    ACPI/processor: Replace racy task affinity logic
    ACPI/processor: Fix error handling in __acpi_processor_start()
    sparc/sysfs: Replace racy task affinity logic
    powerpc/smp: Replace open coded task affinity logic
    ia64/sn/hwperf: Replace racy task affinity logic
    ia64/salinfo: Replace racy task affinity logic
    workqueue: Provide work_on_cpu_safe()
    ia64/topology: Remove cpus_allowed manipulation
    sched/fair: Move the PELT constants into a generated header
    sched/fair: Increase PELT accuracy for small tasks
    sched/fair: Fix comments
    sched/Documentation: Add 'sched-pelt' tool
    sched/fair: Fix corner case in __accumulate_sum()
    sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
    ...

    Linus Torvalds
     
  • Pull workqueue update from Tejun Heo:
    "One trivial patch to use setup_deferrable_timer() instead of
    open-coding the initialization"

    * 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: use setup_deferrable_timer

    Linus Torvalds
     

15 Apr, 2017

1 commit

  • work_on_cpu() is not protected against CPU hotplug. For code which requires
    to be either executed on an online CPU or to fail if the CPU is not
    available the callsite would have to protect against CPU hotplug.

    Provide a function which does get/put_online_cpus() around the call to
    work_on_cpu() and fails the call with -ENODEV if the target CPU is not
    online.

    Preparatory patch to convert several racy task affinity manipulations.

    Signed-off-by: Thomas Gleixner
    Acked-by: Tejun Heo
    Cc: Fenghua Yu
    Cc: Tony Luck
    Cc: Herbert Xu
    Cc: "Rafael J. Wysocki"
    Cc: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Sebastian Siewior
    Cc: Lai Jiangshan
    Cc: Viresh Kumar
    Cc: Michael Ellerman
    Cc: "David S. Miller"
    Cc: Len Brown
    Link: http://lkml.kernel.org/r/20170412201042.262610721@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

07 Mar, 2017

2 commits


10 Feb, 2017

1 commit

  • Currently CONFIG_TIMER_STATS exposes process information across namespaces:

    kernel/time/timer_list.c print_timer():

    SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);

    /proc/timer_list:

    #11: , hrtimer_wakeup, S:01, do_nanosleep, cron/2570

    Given that the tracer can give the same information, this patch entirely
    removes CONFIG_TIMER_STATS.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Kees Cook
    Acked-by: John Stultz
    Cc: Nicolas Pitre
    Cc: linux-doc@vger.kernel.org
    Cc: Lai Jiangshan
    Cc: Shuah Khan
    Cc: Xing Gao
    Cc: Jonathan Corbet
    Cc: Jessica Frazelle
    Cc: kernel-hardening@lists.openwall.com
    Cc: Nicolas Iooss
    Cc: "Paul E. McKenney"
    Cc: Petr Mladek
    Cc: Richard Cochran
    Cc: Tejun Heo
    Cc: Michal Marek
    Cc: Josh Poimboeuf
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Olof Johansson
    Cc: Andrew Morton
    Cc: linux-api@vger.kernel.org
    Cc: Arjan van de Ven
    Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
    Signed-off-by: Thomas Gleixner

    Kees Cook
     

20 Oct, 2016

2 commits

  • Tejun Heo
     
  • While splitting up workqueue initialization into two parts,
    ac8f73400782 ("workqueue: make workqueue available early during boot")
    put wq_numa_init() into workqueue_init_early(). Unfortunately, on
    some archs including power and arm64, cpu to node mapping isn't yet
    established by the time the early init is called leading to incorrect
    NUMA initialization and subsequently the following oops due to zero
    cpumask on node-specific unbound pools.

    Unable to handle kernel paging request for data at address 0x00000038
    Faulting instruction address: 0xc0000000000fc0cc
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
    task: c0000007f5400000 task.stack: c000001ffc084000
    NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
    REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
    MSR: 9000000002009033 CR: 48000424 XER: 00000000
    CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
    GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
    GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
    GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
    GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
    GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
    GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
    GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
    NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
    LR [c0000000000ed928] activate_task+0x78/0xe0
    Call Trace:
    [c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
    [c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
    [c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
    [c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
    [c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
    [c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
    [c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
    [c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
    [c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
    Instruction dump:
    62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
    60420000 72490021 ebfe0150 2f890001 419e0de0 7fbee840 419e0e58
    ---[ end trace 0000000000000000 ]---

    Fix it by moving wq_numa_init() to workqueue_init(). As this means
    that the early intialization may not have full NUMA info for per-cpu
    pools and ignores NUMA affinity for unbound pools, fix them up from
    workqueue_init() after wq_numa_init().

    Signed-off-by: Tejun Heo
    Reported-by: Michael Ellerman
    Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
    Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
    Signed-off-by: Tejun Heo

    Tejun Heo
     

12 Oct, 2016

1 commit

  • Patch series "kthread: Kthread worker API improvements"

    The intention of this patchset is to make it easier to manipulate and
    maintain kthreads. Especially, I want to replace all the custom main
    cycles with a generic one. Also I want to make the kthreads sleep in a
    consistent state in a common place when there is no work.

    This patch (of 11):

    A good practice is to prefix the names of functions by the name of the
    subsystem.

    This patch fixes the name of probe_kthread_data(). The other wrong
    functions names are part of the kthread worker API and will be fixed
    separately.

    Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Suggested-by: Andrew Morton
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

18 Sep, 2016

2 commits

  • keventd_up() no longer has in-kernel users. Remove it and make
    wq_online static.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Workqueue is currently initialized in an early init call; however,
    there are cases where early boot code has to be split and reordered to
    come after workqueue initialization or the same code path which makes
    use of workqueues is used both before workqueue initailization and
    after. The latter cases have to gate workqueue usages with
    keventd_up() tests, which is nasty and easy to get wrong.

    Workqueue usages have become widespread and it'd be a lot more
    convenient if it can be used very early from boot. This patch splits
    workqueue initialization into two steps. workqueue_init_early() which
    sets up the basic data structures so that workqueues can be created
    and work items queued, and workqueue_init() which actually brings up
    workqueues online and starts executing queued work items. The former
    step can be done very early during boot once memory allocation,
    cpumasks and idr are initialized. The latter right after kthreads
    become available.

    This allows work item queueing and canceling from very early boot
    which is what most of these use cases want.

    * As systemd_wq being initialized doesn't indicate that workqueue is
    fully online anymore, update keventd_up() to test wq_online instead.
    The follow-up patches will get rid of all its usages and the
    function itself.

    * Flushing doesn't make sense before workqueue is fully initialized.
    The flush functions trigger WARN and return immediately before fully
    online.

    * Work items are never in-flight before fully online. Canceling can
    always succeed by skipping the flush step.

    * Some code paths can no longer assume to be called with irq enabled
    as irq is disabled during early boot. Use irqsave/restore
    operations instead.

    v2: Watchdog init, which requires timer to be running, moved from
    workqueue_init_early() to workqueue_init().

    Signed-off-by: Tejun Heo
    Suggested-by: Linus Torvalds
    Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com

    Tejun Heo
     

16 Sep, 2016

1 commit

  • destroy_workqueue() performs a number of sanity checks to ensure that
    the workqueue is empty before proceeding with destruction. However,
    it's not always easy to tell what's going on just from the warning
    message. Let's dump workqueue state after sanity check failures to
    help debugging.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/CACT4Y+Zs6vkjHo9qHb4TrEiz3S4+quvvVQ9VWvj2Mx6pETGb9Q@mail.gmail.com
    Cc: Dmitry Vyukov

    Tejun Heo
     

29 Aug, 2016

1 commit


30 Jul, 2016

1 commit

  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

25 Jul, 2016

1 commit

  • * pm-sleep:
    PM / hibernate: Introduce test_resume mode for hibernation
    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
    PM / hibernate: Image data protection during restoration
    PM / hibernate: Add missing braces in __register_nosave_region()
    PM / hibernate: Clean up comments in snapshot.c
    PM / hibernate: Clean up function headers in snapshot.c
    PM / hibernate: Add missing braces in hibernate_setup()
    PM / hibernate: Recycle safe pages after image restoration
    PM / hibernate: Simplify mark_unsafe_pages()
    PM / hibernate: Do not free preallocated safe pages during image restore
    PM / suspend: show workqueue state in suspend flow
    PM / sleep: make PM notifiers called symmetrically
    PM / sleep: Make pm_prepare_console() return void
    PM / Hibernate: Don't let kasan instrument snapshot.c

    * pm-tools:
    PM / tools: scripts: AnalyzeSuspend v4.2
    tools/turbostat: allow user to alter DESTDIR and PREFIX

    Rafael J. Wysocki
     

14 Jul, 2016

1 commit

  • Get rid of the prio ordering of the separate notifiers and use a proper state
    callback pair.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Andrzej Siewior
    Acked-by: Tejun Heo
    Cc: Andrew Morton
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Nicolas Iooss
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rasmus Villemoes
    Cc: Rusty Russell
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153335.197083890@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

02 Jul, 2016

1 commit


17 Jun, 2016

1 commit

  • With commit e9d867a67fd03ccc ("sched: Allow per-cpu kernel threads to
    run on online && !active"), __set_cpus_allowed_ptr() expects that only
    strict per-cpu kernel threads can have affinity to an online CPU which
    is not yet active.

    This assumption is currently broken in the CPU_ONLINE notification
    handler for the workqueues where restore_unbound_workers_cpumask()
    calls set_cpus_allowed_ptr() when the first cpu in the unbound
    worker's pool->attr->cpumask comes online. Since
    set_cpus_allowed_ptr() is called with pool->attr->cpumask in which
    only one CPU is online which is not yet active, we get the following
    WARN_ON during an CPU online operation.

    ------------[ cut here ]------------
    WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166
    __set_cpus_allowed_ptr+0x228/0x2e0
    Modules linked in:
    CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4

    Call Trace:
    [c000000f273ff920] [c00000000010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable)
    [c000000f273ffac0] [c0000000000ed4b0] workqueue_cpu_up_callback+0x2c0/0x470
    [c000000f273ffb70] [c0000000000f5c58] notifier_call_chain+0x98/0x100
    [c000000f273ffbc0] [c0000000000c5ed0] __cpu_notify+0x70/0xe0
    [c000000f273ffc00] [c0000000000c6028] notify_online+0x38/0x50
    [c000000f273ffc30] [c0000000000c5214] cpuhp_invoke_callback+0x84/0x250
    [c000000f273ffc90] [c0000000000c562c] cpuhp_up_callbacks+0x5c/0x120
    [c000000f273ffce0] [c0000000000c64d4] cpuhp_thread_fun+0x184/0x1c0
    [c000000f273ffd20] [c0000000000fa050] smpboot_thread_fn+0x290/0x2a0
    [c000000f273ffd80] [c0000000000f45b0] kthread+0x110/0x130
    [c000000f273ffe30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
    ---[ end trace 00f1456578b2a3b2 ]---

    This patch fixes this by limiting the mask to the intersection of
    the pool affinity and online CPUs.

    Changelog-cribbed-from: Gautham R. Shenoy
    Reported-by: Abdul Haleem
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Tejun Heo

    Peter Zijlstra
     

20 May, 2016

2 commits

  • When activating a static object we need make sure that the object is
    tracked in the object tracker. If it is a non-static object then the
    activation is illegal.

    In previous implementation, each subsystem need take care of this in
    their fixup callbacks. Actually we can put it into debugobjects core.
    Thus we can save duplicated code, and have *pure* fixup callbacks.

    To achieve this, a new callback "is_static_object" is introduced to let
    the type specific code decide whether a object is static or not. If
    yes, we take it into object tracker, otherwise give warning and invoke
    fixup callback.

    This change has paassed debugobjects selftest, and I also do some test
    with all debugobjects supports enabled.

    At last, I have a concern about the fixups that can it change the object
    which is in incorrect state on fixup? Because the 'addr' may not point
    to any valid object if a non-static object is not tracked. Then Change
    such object can overwrite someone's memory and cause unexpected
    behaviour. For example, the timer_fixup_activate bind timer to function
    stub_timer.

    Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
    [changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
    Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    change (debugobjects: make fixup functions return bool instead of int)

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     

14 May, 2016

1 commit

  • Pull workqueue fix from Tejun Heo:
    "CPU hotplug callbacks can invoke DOWN_FAILED w/o preceding
    DOWN_PREPARE which can trigger a WARN_ON() in workqueue.

    The bug has been there for a very long time. It only triggers if CPU
    down fails at a specific point and I don't think it has adverse
    effects other than the warning messages. The fix is very low impact"

    * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix rebind bound workers warning

    Linus Torvalds
     

13 May, 2016

1 commit

  • ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
    Modules linked in:
    CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
    Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
    0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
    0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
    ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
    Call Trace:
    dump_stack+0x89/0xd4
    __warn+0xfd/0x120
    warn_slowpath_null+0x1d/0x20
    rebind_workers+0x1c0/0x1d0
    workqueue_cpu_up_callback+0xf5/0x1d0
    notifier_call_chain+0x64/0x90
    ? trace_hardirqs_on_caller+0xf2/0x220
    ? notify_prepare+0x80/0x80
    __raw_notifier_call_chain+0xe/0x10
    __cpu_notify+0x35/0x50
    notify_down_prepare+0x5e/0x80
    ? notify_prepare+0x80/0x80
    cpuhp_invoke_callback+0x73/0x330
    ? __schedule+0x33e/0x8a0
    cpuhp_down_callbacks+0x51/0xc0
    cpuhp_thread_fun+0xc1/0xf0
    smpboot_thread_fn+0x159/0x2a0
    ? smpboot_create_threads+0x80/0x80
    kthread+0xef/0x110
    ? wait_for_completion+0xf0/0x120
    ? schedule_tail+0x35/0xf0
    ret_from_fork+0x22/0x50
    ? __init_kthread_worker+0x70/0x70
    ---[ end trace eb12ae47d2382d8f ]---
    notify_down_prepare: attempt to take down CPU 0 failed

    This bug can be reproduced by below config w/ nohz_full= all cpus:

    CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
    CONFIG_DEBUG_HOTPLUG_CPU0=y
    CONFIG_NO_HZ_FULL=y

    As Thomas pointed out:

    | If a down prepare callback fails, then DOWN_FAILED is invoked for all
    | callbacks which have successfully executed DOWN_PREPARE.
    |
    | But, workqueue has actually two notifiers. One which handles
    | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
    |
    | Now look at the priorities of those callbacks:
    |
    | CPU_PRI_WORKQUEUE_UP = 5
    | CPU_PRI_WORKQUEUE_DOWN = -5
    |
    | So the call order on DOWN_PREPARE is:
    |
    | CB 1
    | CB ...
    | CB workqueue_up() -> Ignores DOWN_PREPARE
    | CB ...
    | CB X ---> Fails
    |
    | So we call up to CB X with DOWN_FAILED
    |
    | CB 1
    | CB ...
    | CB workqueue_up() -> Handles DOWN_FAILED
    | CB ...
    | CB X-1
    |
    | So the problem is that the workqueue stuff handles DOWN_FAILED in the up
    | callback, while it should do it in the down callback. Which is not a good idea
    | either because it wants to be called early on rollback...
    |
    | Brilliant stuff, isn't it? The hotplug rework will solve this problem because
    | the callbacks become symetric, but for the existing mess, we need some
    | workaround in the workqueue code.

    The boot CPU handles housekeeping duty(unbound timers, workqueues,
    timekeeping, ...) on behalf of full dynticks CPUs. It must remain
    online when nohz full is enabled. There is a priority set to every
    notifier_blocks:

    workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down

    So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
    notifier_blocks behind tick_nohz_cpu_down will not be called any
    more, which leads to workers are actually not unbound. Then hotplug
    state machine will fallback to undo and online cpu 0 again. Workers
    will be rebound unconditionally even if they are not unbound and
    trigger the warning in this progress.

    This patch fix it by catching !DISASSOCIATED to avoid rebind bound
    workers.

    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frédéric Weisbecker
    Cc: stable@vger.kernel.org
    Suggested-by: Lai Jiangshan
    Signed-off-by: Wanpeng Li

    Wanpeng Li
     

28 Apr, 2016

1 commit

  • Pull workqueue fix from Tejun Heo:
    "So, it turns out we had a silly bug in the most fundamental part of
    workqueue for a very long time. AFAICS, this dates back to pre-git
    era and has quite likely been there from the time workqueue was first
    introduced.

    A work item uses its PENDING bit to synchronize multiple queuers.
    Anyone who wins the PENDING bit owns the pending state of the work
    item. Whether a queuer wins or loses the race, one thing should be
    guaranteed - there will soon be at least one execution of the work
    item - where "after" means that the execution instance would be able
    to see all the changes that the queuer has made prior to the queueing
    attempt.

    Unfortunately, we were missing a smp_mb() after clearing PENDING for
    execution, so nothing guaranteed visibility of the changes that a
    queueing loser has made, which manifested as a reproducible blk-mq
    stall.

    Lots of kudos to Roman for debugging the problem. The patch for
    -stable is the minimal one. For v3.7, Peter is working on a patch to
    make the code path slightly more efficient and less fragile"

    * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix ghost PENDING flag while doing MQ IO

    Linus Torvalds
     

26 Apr, 2016

1 commit

  • The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
    with the following backtrace:

    [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
    [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
    [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
    [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
    [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
    [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
    [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
    [ 601.350965] Call Trace:
    [ 601.351203] [] ? bit_wait+0x60/0x60
    [ 601.351444] [] schedule+0x35/0x80
    [ 601.351709] [] schedule_timeout+0x192/0x230
    [ 601.351958] [] ? blk_flush_plug_list+0xc7/0x220
    [ 601.352208] [] ? ktime_get+0x37/0xa0
    [ 601.352446] [] ? bit_wait+0x60/0x60
    [ 601.352688] [] io_schedule_timeout+0xa4/0x110
    [ 601.352951] [] ? _raw_spin_unlock_irqrestore+0xe/0x10
    [ 601.353196] [] bit_wait_io+0x1b/0x70
    [ 601.353440] [] __wait_on_bit+0x5d/0x90
    [ 601.353689] [] wait_on_page_bit+0xc0/0xd0
    [ 601.353958] [] ? autoremove_wake_function+0x40/0x40
    [ 601.354200] [] __filemap_fdatawait_range+0xe4/0x140
    [ 601.354441] [] filemap_fdatawait_range+0x14/0x30
    [ 601.354688] [] filemap_write_and_wait_range+0x3f/0x70
    [ 601.354932] [] blkdev_fsync+0x1b/0x50
    [ 601.355193] [] vfs_fsync_range+0x49/0xa0
    [ 601.355432] [] blkdev_write_iter+0xca/0x100
    [ 601.355679] [] __vfs_write+0xaa/0xe0
    [ 601.355925] [] vfs_write+0xa9/0x1a0
    [ 601.356164] [] kernel_write+0x38/0x50

    The underlying device is a null_blk, with default parameters:

    queue_mode = MQ
    submit_queues = 1

    Verification that nullb0 has something inflight:

    root@pserver8:~# cat /sys/block/nullb0/inflight
    0 1
    root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
    ...
    /sys/block/nullb0/mq/0/cpu2/rq_list
    CTX pending:
    ffff8838038e2400
    ...

    During debug it became clear that stalled request is always inserted in
    the rq_list from the following path:

    save_stack_trace_tsk + 34
    blk_mq_insert_requests + 231
    blk_mq_flush_plug_list + 281
    blk_flush_plug_list + 199
    wait_on_page_bit + 192
    __filemap_fdatawait_range + 228
    filemap_fdatawait_range + 20
    filemap_write_and_wait_range + 63
    blkdev_fsync + 27
    vfs_fsync_range + 73
    blkdev_write_iter + 202
    __vfs_write + 170
    vfs_write + 169
    kernel_write + 56

    So blk_flush_plug_list() was called with from_schedule == true.

    If from_schedule is true, that means that finally blk_mq_insert_requests()
    offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
    i.e. it calls kblockd_schedule_delayed_work_on().

    That means, that we race with another CPU, which is about to execute
    __blk_mq_run_hw_queue() work.

    Further debugging shows the following traces from different CPUs:

    CPU#0 CPU#1
    ---------------------------------- -------------------------------
    reqeust A inserted
    STORE hctx->ctx_map[0] bit marked
    kblockd_schedule...() returns 1

    request B inserted
    STORE hctx->ctx_map[1] bit marked
    kblockd_schedule...() returns 0
    *** WORK PENDING bit is cleared ***
    flush_busy_ctxs() is executed, but
    bit 1, set by CPU#1, is not observed

    As a result request B pended forever.

    This behaviour can be explained by speculative LOAD of hctx->ctx_map on
    CPU#0, which is reordered with clear of PENDING bit and executed _before_
    actual STORE of bit 1 on CPU#1.

    The proper fix is an explicit full barrier , which guarantees
    that clear of PENDING bit is to be executed before all possible
    speculative LOADS or STORES inside actual work function.

    Signed-off-by: Roman Pen
    Cc: Gioh Kim
    Cc: Michael Wang
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Tejun Heo

    Roman Pen
     

19 Mar, 2016

1 commit


16 Mar, 2016

1 commit

  • $ make tags
    GEN tags
    ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
    ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
    ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
    ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
    ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
    ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
    ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

    Which are all the result of the DEFINE_PER_CPU pattern:

    scripts/tags.sh:200: '/\
    Acked-by: David S. Miller
    Acked-by: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

12 Mar, 2016

1 commit


02 Mar, 2016

1 commit


18 Feb, 2016

1 commit


11 Feb, 2016

1 commit

  • When looking up the pool_workqueue to use for an unbound workqueue,
    workqueue assumes that the target CPU is always bound to a valid NUMA
    node. However, currently, when a CPU goes offline, the mapping is
    destroyed and cpu_to_node() returns NUMA_NO_NODE.

    This has always been broken but hasn't triggered often enough before
    874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
    After the commit, workqueue forcifully assigns the local CPU for
    delayed work items without explicit target CPU to fix a different
    issue. This widens the window where CPU can go offline while a
    delayed work item is pending causing delayed work items dispatched
    with target CPU set to an already offlined CPU. The resulting
    NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
    NULL pool_workqueue and thus crash.

    While 874bbfe600a6 has been reverted for a different reason making the
    bug less visible again, it can still happen. Fix it by mapping
    NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
    This is a temporary workaround. The long term solution is keeping CPU
    -> NODE mapping stable across CPU off/online cycles which is being
    worked on.

    Signed-off-by: Tejun Heo
    Reported-by: Mike Galbraith
    Cc: Tang Chen
    Cc: Rafael J. Wysocki
    Cc: Len Brown
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
    Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com

    Tejun Heo
     

10 Feb, 2016

3 commits

  • Workqueue used to guarantee local execution for work items queued
    without explicit target CPU. The guarantee is gone now which can
    break some usages in subtle ways. To flush out those cases, this
    patch implements a debug feature which forces round-robin CPU
    selection for all such work items.

    The debug feature defaults to off and can be enabled with a kernel
    parameter. The default can be flipped with a debug config option.

    If you hit this commit during bisection, please refer to 041bd12e272c
    ("Revert "workqueue: make sure delayed work run in local cpu"") for
    more information and ping me.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • WORK_CPU_UNBOUND work items queued to a bound workqueue always run
    locally. This is a good thing normally, but not when the user has
    asked us to keep unbound work away from certain CPUs. Round robin
    these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
    trumps performance.

    tj: Cosmetic and comment changes. WARN_ON_ONCE() dropped from empty
    (wq_unbound_cpumask AND cpu_online_mask). If we want that, it
    should be done when config changes.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Tejun Heo

    Mike Galbraith
     
  • This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76.

    Workqueue used to implicity guarantee that work items queued without
    explicit CPU specified are put on the local CPU. Recent changes in
    timer broke the guarantee and led to vmstat breakage which was fixed
    by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU
    we need it to run on").

    vmstat is the most likely to expose the issue and it's quite possible
    that there are other similar problems which are a lot more difficult
    to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make
    sure delayed work run in local cpu") was applied to restore the local
    CPU guarnatee. Unfortunately, the change exposed a bug in timer code
    which got fixed by 22b886dd1018 ("timers: Use proper base migration in
    add_timer_on()"). Due to code restructuring, the commit couldn't be
    backported beyond certain point and stable kernels which only had
    874bbfe600a6 started crashing.

    The local CPU guarantee was accidental more than anything else and we
    want to get rid of it anyway. As, with the vmstat case fixed,
    874bbfe600a6 is causing more problems than it's fixing, it has been
    decided to take the chance and officially break the guarantee by
    reverting the commit. A debug feature will be added to force foreign
    CPU assignment to expose cases relying on the guarantee and fixes for
    the individual cases will be backported to stable as necessary.

    Signed-off-by: Tejun Heo
    Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu")
    Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
    Cc: stable@vger.kernel.org
    Cc: Mike Galbraith
    Cc: Henrique de Moraes Holschuh
    Cc: Daniel Bilik
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Sasha Levin
    Cc: Ben Hutchings
    Cc: Thomas Gleixner
    Cc: Daniel Bilik
    Cc: Jiri Slaby
    Cc: Michal Hocko

    Tejun Heo
     

30 Jan, 2016

1 commit

  • fca839c00a12 ("workqueue: warn if memory reclaim tries to flush
    !WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
    triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
    flush a !WQ_MEM_RECLAIM workquee.

    This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
    reclaim path and making it depend on something which may need more
    memory to make forward progress can lead to deadlocks. Unfortunately,
    workqueues created with the legacy create*_workqueue() interface
    always have WQ_MEM_RECLAIM regardless of whether they are depended
    upon memory reclaim or not. These spurious WQ_MEM_RECLAIM markings
    cause spurious triggering of the flush dependency checks.

    WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
    workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
    ...
    Workqueue: deferwq deferred_probe_work_func
    [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [] (show_stack) from [] (dump_stack+0x94/0xd4)
    [] (dump_stack) from [] (warn_slowpath_common+0x80/0xb0)
    [] (warn_slowpath_common) from [] (warn_slowpath_fmt+0x30/0x40)
    [] (warn_slowpath_fmt) from [] (check_flush_dependency+0x138/0x144)
    [] (check_flush_dependency) from [] (flush_work+0x50/0x15c)
    [] (flush_work) from [] (lru_add_drain_all+0x130/0x180)
    [] (lru_add_drain_all) from [] (migrate_prep+0x8/0x10)
    [] (migrate_prep) from [] (alloc_contig_range+0xd8/0x338)
    [] (alloc_contig_range) from [] (cma_alloc+0xe0/0x1ac)
    [] (cma_alloc) from [] (__alloc_from_contiguous+0x38/0xd8)
    [] (__alloc_from_contiguous) from [] (__dma_alloc+0x240/0x278)
    [] (__dma_alloc) from [] (arm_dma_alloc+0x54/0x5c)
    [] (arm_dma_alloc) from [] (dmam_alloc_coherent+0xc0/0xec)
    [] (dmam_alloc_coherent) from [] (ahci_port_start+0x150/0x1dc)
    [] (ahci_port_start) from [] (ata_host_start.part.3+0xc8/0x1c8)
    [] (ata_host_start.part.3) from [] (ata_host_activate+0x50/0x148)
    [] (ata_host_activate) from [] (ahci_host_activate+0x44/0x114)
    [] (ahci_host_activate) from [] (ahci_platform_init_host+0x1d8/0x3c8)
    [] (ahci_platform_init_host) from [] (tegra_ahci_probe+0x448/0x4e8)
    [] (tegra_ahci_probe) from [] (platform_drv_probe+0x50/0xac)
    [] (platform_drv_probe) from [] (driver_probe_device+0x214/0x2c0)
    [] (driver_probe_device) from [] (bus_for_each_drv+0x60/0x94)
    [] (bus_for_each_drv) from [] (__device_attach+0xb0/0x114)
    [] (__device_attach) from [] (bus_probe_device+0x84/0x8c)
    [] (bus_probe_device) from [] (deferred_probe_work_func+0x68/0x98)
    [] (deferred_probe_work_func) from [] (process_one_work+0x120/0x3f8)
    [] (process_one_work) from [] (worker_thread+0x38/0x55c)
    [] (worker_thread) from [] (kthread+0xdc/0xf4)
    [] (kthread) from [] (ret_from_fork+0x14/0x3c)

    Fix it by marking workqueues created via create*_workqueue() with
    __WQ_LEGACY and disabling flush dependency checks on them.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Thierry Reding
    Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
    Fixes: fca839c00a12 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")

    Tejun Heo
     

08 Jan, 2016

1 commit


09 Dec, 2015

2 commits

  • Workqueue stalls can happen from a variety of usage bugs such as
    missing WQ_MEM_RECLAIM flag or concurrency managed work item
    indefinitely staying RUNNING. These stalls can be extremely difficult
    to hunt down because the usual warning mechanisms can't detect
    workqueue stalls and the internal state is pretty opaque.

    To alleviate the situation, this patch implements workqueue lockup
    detector. It periodically monitors all worker_pools periodically and,
    if any pool failed to make forward progress longer than the threshold
    duration, triggers warning and dumps workqueue state as follows.

    BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
    Showing busy workqueues and worker pools:
    workqueue events: flags=0x0
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
    pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
    workqueue events_power_efficient: flags=0x80
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    pending: check_lifetime, neigh_periodic_work
    workqueue cgroup_pidlist_destroy: flags=0x0
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
    pending: cgroup_pidlist_destroy_work_fn
    ...

    The detection mechanism is controller through kernel parameter
    workqueue.watchdog_thresh and can be updated at runtime through the
    sysfs module parameter file.

    v2: Decoupled from softlockup control knobs.

    Signed-off-by: Tejun Heo
    Acked-by: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Michal Hocko
    Cc: Chris Mason
    Cc: Andrew Morton

    Tejun Heo
     
  • Task or work item involved in memory reclaim trying to flush a
    non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
    deadlock. Trigger WARN_ONCE() if such conditions are detected.

    Signed-off-by: Tejun Heo
    Cc: Peter Zijlstra

    Tejun Heo
     

06 Nov, 2015

1 commit

  • Pull workqueue update from Tejun Heo:
    "This pull request contains one patch to make an unbound worker pool
    allocated from the NUMA node containing it if such node exists. As
    unbound worker pools are node-affine by default, this makes most pools
    allocated on the right node"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Allocate the unbound pool using local node memory

    Linus Torvalds
     

13 Oct, 2015

1 commit