12 Oct, 2016

1 commit

  • Patch series "kthread: Kthread worker API improvements"

    The intention of this patchset is to make it easier to manipulate and
    maintain kthreads. Especially, I want to replace all the custom main
    cycles with a generic one. Also I want to make the kthreads sleep in a
    consistent state in a common place when there is no work.

    This patch (of 11):

    A good practice is to prefix the names of functions by the name of the
    subsystem.

    This patch fixes the name of probe_kthread_data(). The other wrong
    functions names are part of the kthread worker API and will be fixed
    separately.

    Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Suggested-by: Andrew Morton
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

29 Aug, 2016

1 commit


30 Jul, 2016

1 commit

  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

25 Jul, 2016

1 commit

  • * pm-sleep:
    PM / hibernate: Introduce test_resume mode for hibernation
    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
    PM / hibernate: Image data protection during restoration
    PM / hibernate: Add missing braces in __register_nosave_region()
    PM / hibernate: Clean up comments in snapshot.c
    PM / hibernate: Clean up function headers in snapshot.c
    PM / hibernate: Add missing braces in hibernate_setup()
    PM / hibernate: Recycle safe pages after image restoration
    PM / hibernate: Simplify mark_unsafe_pages()
    PM / hibernate: Do not free preallocated safe pages during image restore
    PM / suspend: show workqueue state in suspend flow
    PM / sleep: make PM notifiers called symmetrically
    PM / sleep: Make pm_prepare_console() return void
    PM / Hibernate: Don't let kasan instrument snapshot.c

    * pm-tools:
    PM / tools: scripts: AnalyzeSuspend v4.2
    tools/turbostat: allow user to alter DESTDIR and PREFIX

    Rafael J. Wysocki
     

14 Jul, 2016

1 commit

  • Get rid of the prio ordering of the separate notifiers and use a proper state
    callback pair.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Andrzej Siewior
    Acked-by: Tejun Heo
    Cc: Andrew Morton
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Nicolas Iooss
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rasmus Villemoes
    Cc: Rusty Russell
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153335.197083890@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

02 Jul, 2016

1 commit


17 Jun, 2016

1 commit

  • With commit e9d867a67fd03ccc ("sched: Allow per-cpu kernel threads to
    run on online && !active"), __set_cpus_allowed_ptr() expects that only
    strict per-cpu kernel threads can have affinity to an online CPU which
    is not yet active.

    This assumption is currently broken in the CPU_ONLINE notification
    handler for the workqueues where restore_unbound_workers_cpumask()
    calls set_cpus_allowed_ptr() when the first cpu in the unbound
    worker's pool->attr->cpumask comes online. Since
    set_cpus_allowed_ptr() is called with pool->attr->cpumask in which
    only one CPU is online which is not yet active, we get the following
    WARN_ON during an CPU online operation.

    ------------[ cut here ]------------
    WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166
    __set_cpus_allowed_ptr+0x228/0x2e0
    Modules linked in:
    CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4

    Call Trace:
    [c000000f273ff920] [c00000000010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable)
    [c000000f273ffac0] [c0000000000ed4b0] workqueue_cpu_up_callback+0x2c0/0x470
    [c000000f273ffb70] [c0000000000f5c58] notifier_call_chain+0x98/0x100
    [c000000f273ffbc0] [c0000000000c5ed0] __cpu_notify+0x70/0xe0
    [c000000f273ffc00] [c0000000000c6028] notify_online+0x38/0x50
    [c000000f273ffc30] [c0000000000c5214] cpuhp_invoke_callback+0x84/0x250
    [c000000f273ffc90] [c0000000000c562c] cpuhp_up_callbacks+0x5c/0x120
    [c000000f273ffce0] [c0000000000c64d4] cpuhp_thread_fun+0x184/0x1c0
    [c000000f273ffd20] [c0000000000fa050] smpboot_thread_fn+0x290/0x2a0
    [c000000f273ffd80] [c0000000000f45b0] kthread+0x110/0x130
    [c000000f273ffe30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
    ---[ end trace 00f1456578b2a3b2 ]---

    This patch fixes this by limiting the mask to the intersection of
    the pool affinity and online CPUs.

    Changelog-cribbed-from: Gautham R. Shenoy
    Reported-by: Abdul Haleem
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Tejun Heo

    Peter Zijlstra
     

20 May, 2016

2 commits

  • When activating a static object we need make sure that the object is
    tracked in the object tracker. If it is a non-static object then the
    activation is illegal.

    In previous implementation, each subsystem need take care of this in
    their fixup callbacks. Actually we can put it into debugobjects core.
    Thus we can save duplicated code, and have *pure* fixup callbacks.

    To achieve this, a new callback "is_static_object" is introduced to let
    the type specific code decide whether a object is static or not. If
    yes, we take it into object tracker, otherwise give warning and invoke
    fixup callback.

    This change has paassed debugobjects selftest, and I also do some test
    with all debugobjects supports enabled.

    At last, I have a concern about the fixups that can it change the object
    which is in incorrect state on fixup? Because the 'addr' may not point
    to any valid object if a non-static object is not tracked. Then Change
    such object can overwrite someone's memory and cause unexpected
    behaviour. For example, the timer_fixup_activate bind timer to function
    stub_timer.

    Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
    [changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
    Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    change (debugobjects: make fixup functions return bool instead of int)

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     

14 May, 2016

1 commit

  • Pull workqueue fix from Tejun Heo:
    "CPU hotplug callbacks can invoke DOWN_FAILED w/o preceding
    DOWN_PREPARE which can trigger a WARN_ON() in workqueue.

    The bug has been there for a very long time. It only triggers if CPU
    down fails at a specific point and I don't think it has adverse
    effects other than the warning messages. The fix is very low impact"

    * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix rebind bound workers warning

    Linus Torvalds
     

13 May, 2016

1 commit

  • ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
    Modules linked in:
    CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
    Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
    0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
    0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
    ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
    Call Trace:
    dump_stack+0x89/0xd4
    __warn+0xfd/0x120
    warn_slowpath_null+0x1d/0x20
    rebind_workers+0x1c0/0x1d0
    workqueue_cpu_up_callback+0xf5/0x1d0
    notifier_call_chain+0x64/0x90
    ? trace_hardirqs_on_caller+0xf2/0x220
    ? notify_prepare+0x80/0x80
    __raw_notifier_call_chain+0xe/0x10
    __cpu_notify+0x35/0x50
    notify_down_prepare+0x5e/0x80
    ? notify_prepare+0x80/0x80
    cpuhp_invoke_callback+0x73/0x330
    ? __schedule+0x33e/0x8a0
    cpuhp_down_callbacks+0x51/0xc0
    cpuhp_thread_fun+0xc1/0xf0
    smpboot_thread_fn+0x159/0x2a0
    ? smpboot_create_threads+0x80/0x80
    kthread+0xef/0x110
    ? wait_for_completion+0xf0/0x120
    ? schedule_tail+0x35/0xf0
    ret_from_fork+0x22/0x50
    ? __init_kthread_worker+0x70/0x70
    ---[ end trace eb12ae47d2382d8f ]---
    notify_down_prepare: attempt to take down CPU 0 failed

    This bug can be reproduced by below config w/ nohz_full= all cpus:

    CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
    CONFIG_DEBUG_HOTPLUG_CPU0=y
    CONFIG_NO_HZ_FULL=y

    As Thomas pointed out:

    | If a down prepare callback fails, then DOWN_FAILED is invoked for all
    | callbacks which have successfully executed DOWN_PREPARE.
    |
    | But, workqueue has actually two notifiers. One which handles
    | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
    |
    | Now look at the priorities of those callbacks:
    |
    | CPU_PRI_WORKQUEUE_UP = 5
    | CPU_PRI_WORKQUEUE_DOWN = -5
    |
    | So the call order on DOWN_PREPARE is:
    |
    | CB 1
    | CB ...
    | CB workqueue_up() -> Ignores DOWN_PREPARE
    | CB ...
    | CB X ---> Fails
    |
    | So we call up to CB X with DOWN_FAILED
    |
    | CB 1
    | CB ...
    | CB workqueue_up() -> Handles DOWN_FAILED
    | CB ...
    | CB X-1
    |
    | So the problem is that the workqueue stuff handles DOWN_FAILED in the up
    | callback, while it should do it in the down callback. Which is not a good idea
    | either because it wants to be called early on rollback...
    |
    | Brilliant stuff, isn't it? The hotplug rework will solve this problem because
    | the callbacks become symetric, but for the existing mess, we need some
    | workaround in the workqueue code.

    The boot CPU handles housekeeping duty(unbound timers, workqueues,
    timekeeping, ...) on behalf of full dynticks CPUs. It must remain
    online when nohz full is enabled. There is a priority set to every
    notifier_blocks:

    workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down

    So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
    notifier_blocks behind tick_nohz_cpu_down will not be called any
    more, which leads to workers are actually not unbound. Then hotplug
    state machine will fallback to undo and online cpu 0 again. Workers
    will be rebound unconditionally even if they are not unbound and
    trigger the warning in this progress.

    This patch fix it by catching !DISASSOCIATED to avoid rebind bound
    workers.

    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frédéric Weisbecker
    Cc: stable@vger.kernel.org
    Suggested-by: Lai Jiangshan
    Signed-off-by: Wanpeng Li

    Wanpeng Li
     

28 Apr, 2016

1 commit

  • Pull workqueue fix from Tejun Heo:
    "So, it turns out we had a silly bug in the most fundamental part of
    workqueue for a very long time. AFAICS, this dates back to pre-git
    era and has quite likely been there from the time workqueue was first
    introduced.

    A work item uses its PENDING bit to synchronize multiple queuers.
    Anyone who wins the PENDING bit owns the pending state of the work
    item. Whether a queuer wins or loses the race, one thing should be
    guaranteed - there will soon be at least one execution of the work
    item - where "after" means that the execution instance would be able
    to see all the changes that the queuer has made prior to the queueing
    attempt.

    Unfortunately, we were missing a smp_mb() after clearing PENDING for
    execution, so nothing guaranteed visibility of the changes that a
    queueing loser has made, which manifested as a reproducible blk-mq
    stall.

    Lots of kudos to Roman for debugging the problem. The patch for
    -stable is the minimal one. For v3.7, Peter is working on a patch to
    make the code path slightly more efficient and less fragile"

    * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix ghost PENDING flag while doing MQ IO

    Linus Torvalds
     

26 Apr, 2016

1 commit

  • The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
    with the following backtrace:

    [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
    [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
    [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
    [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
    [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
    [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
    [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
    [ 601.350965] Call Trace:
    [ 601.351203] [] ? bit_wait+0x60/0x60
    [ 601.351444] [] schedule+0x35/0x80
    [ 601.351709] [] schedule_timeout+0x192/0x230
    [ 601.351958] [] ? blk_flush_plug_list+0xc7/0x220
    [ 601.352208] [] ? ktime_get+0x37/0xa0
    [ 601.352446] [] ? bit_wait+0x60/0x60
    [ 601.352688] [] io_schedule_timeout+0xa4/0x110
    [ 601.352951] [] ? _raw_spin_unlock_irqrestore+0xe/0x10
    [ 601.353196] [] bit_wait_io+0x1b/0x70
    [ 601.353440] [] __wait_on_bit+0x5d/0x90
    [ 601.353689] [] wait_on_page_bit+0xc0/0xd0
    [ 601.353958] [] ? autoremove_wake_function+0x40/0x40
    [ 601.354200] [] __filemap_fdatawait_range+0xe4/0x140
    [ 601.354441] [] filemap_fdatawait_range+0x14/0x30
    [ 601.354688] [] filemap_write_and_wait_range+0x3f/0x70
    [ 601.354932] [] blkdev_fsync+0x1b/0x50
    [ 601.355193] [] vfs_fsync_range+0x49/0xa0
    [ 601.355432] [] blkdev_write_iter+0xca/0x100
    [ 601.355679] [] __vfs_write+0xaa/0xe0
    [ 601.355925] [] vfs_write+0xa9/0x1a0
    [ 601.356164] [] kernel_write+0x38/0x50

    The underlying device is a null_blk, with default parameters:

    queue_mode = MQ
    submit_queues = 1

    Verification that nullb0 has something inflight:

    root@pserver8:~# cat /sys/block/nullb0/inflight
    0 1
    root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
    ...
    /sys/block/nullb0/mq/0/cpu2/rq_list
    CTX pending:
    ffff8838038e2400
    ...

    During debug it became clear that stalled request is always inserted in
    the rq_list from the following path:

    save_stack_trace_tsk + 34
    blk_mq_insert_requests + 231
    blk_mq_flush_plug_list + 281
    blk_flush_plug_list + 199
    wait_on_page_bit + 192
    __filemap_fdatawait_range + 228
    filemap_fdatawait_range + 20
    filemap_write_and_wait_range + 63
    blkdev_fsync + 27
    vfs_fsync_range + 73
    blkdev_write_iter + 202
    __vfs_write + 170
    vfs_write + 169
    kernel_write + 56

    So blk_flush_plug_list() was called with from_schedule == true.

    If from_schedule is true, that means that finally blk_mq_insert_requests()
    offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
    i.e. it calls kblockd_schedule_delayed_work_on().

    That means, that we race with another CPU, which is about to execute
    __blk_mq_run_hw_queue() work.

    Further debugging shows the following traces from different CPUs:

    CPU#0 CPU#1
    ---------------------------------- -------------------------------
    reqeust A inserted
    STORE hctx->ctx_map[0] bit marked
    kblockd_schedule...() returns 1

    request B inserted
    STORE hctx->ctx_map[1] bit marked
    kblockd_schedule...() returns 0
    *** WORK PENDING bit is cleared ***
    flush_busy_ctxs() is executed, but
    bit 1, set by CPU#1, is not observed

    As a result request B pended forever.

    This behaviour can be explained by speculative LOAD of hctx->ctx_map on
    CPU#0, which is reordered with clear of PENDING bit and executed _before_
    actual STORE of bit 1 on CPU#1.

    The proper fix is an explicit full barrier , which guarantees
    that clear of PENDING bit is to be executed before all possible
    speculative LOADS or STORES inside actual work function.

    Signed-off-by: Roman Pen
    Cc: Gioh Kim
    Cc: Michael Wang
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Tejun Heo

    Roman Pen
     

19 Mar, 2016

1 commit


16 Mar, 2016

1 commit

  • $ make tags
    GEN tags
    ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
    ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
    ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
    ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
    ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
    ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
    ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

    Which are all the result of the DEFINE_PER_CPU pattern:

    scripts/tags.sh:200: '/\
    Acked-by: David S. Miller
    Acked-by: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

12 Mar, 2016

1 commit


02 Mar, 2016

1 commit


18 Feb, 2016

1 commit


11 Feb, 2016

1 commit

  • When looking up the pool_workqueue to use for an unbound workqueue,
    workqueue assumes that the target CPU is always bound to a valid NUMA
    node. However, currently, when a CPU goes offline, the mapping is
    destroyed and cpu_to_node() returns NUMA_NO_NODE.

    This has always been broken but hasn't triggered often enough before
    874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
    After the commit, workqueue forcifully assigns the local CPU for
    delayed work items without explicit target CPU to fix a different
    issue. This widens the window where CPU can go offline while a
    delayed work item is pending causing delayed work items dispatched
    with target CPU set to an already offlined CPU. The resulting
    NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
    NULL pool_workqueue and thus crash.

    While 874bbfe600a6 has been reverted for a different reason making the
    bug less visible again, it can still happen. Fix it by mapping
    NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
    This is a temporary workaround. The long term solution is keeping CPU
    -> NODE mapping stable across CPU off/online cycles which is being
    worked on.

    Signed-off-by: Tejun Heo
    Reported-by: Mike Galbraith
    Cc: Tang Chen
    Cc: Rafael J. Wysocki
    Cc: Len Brown
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
    Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com

    Tejun Heo
     

10 Feb, 2016

3 commits

  • Workqueue used to guarantee local execution for work items queued
    without explicit target CPU. The guarantee is gone now which can
    break some usages in subtle ways. To flush out those cases, this
    patch implements a debug feature which forces round-robin CPU
    selection for all such work items.

    The debug feature defaults to off and can be enabled with a kernel
    parameter. The default can be flipped with a debug config option.

    If you hit this commit during bisection, please refer to 041bd12e272c
    ("Revert "workqueue: make sure delayed work run in local cpu"") for
    more information and ping me.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • WORK_CPU_UNBOUND work items queued to a bound workqueue always run
    locally. This is a good thing normally, but not when the user has
    asked us to keep unbound work away from certain CPUs. Round robin
    these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
    trumps performance.

    tj: Cosmetic and comment changes. WARN_ON_ONCE() dropped from empty
    (wq_unbound_cpumask AND cpu_online_mask). If we want that, it
    should be done when config changes.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Tejun Heo

    Mike Galbraith
     
  • This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76.

    Workqueue used to implicity guarantee that work items queued without
    explicit CPU specified are put on the local CPU. Recent changes in
    timer broke the guarantee and led to vmstat breakage which was fixed
    by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU
    we need it to run on").

    vmstat is the most likely to expose the issue and it's quite possible
    that there are other similar problems which are a lot more difficult
    to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make
    sure delayed work run in local cpu") was applied to restore the local
    CPU guarnatee. Unfortunately, the change exposed a bug in timer code
    which got fixed by 22b886dd1018 ("timers: Use proper base migration in
    add_timer_on()"). Due to code restructuring, the commit couldn't be
    backported beyond certain point and stable kernels which only had
    874bbfe600a6 started crashing.

    The local CPU guarantee was accidental more than anything else and we
    want to get rid of it anyway. As, with the vmstat case fixed,
    874bbfe600a6 is causing more problems than it's fixing, it has been
    decided to take the chance and officially break the guarantee by
    reverting the commit. A debug feature will be added to force foreign
    CPU assignment to expose cases relying on the guarantee and fixes for
    the individual cases will be backported to stable as necessary.

    Signed-off-by: Tejun Heo
    Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu")
    Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
    Cc: stable@vger.kernel.org
    Cc: Mike Galbraith
    Cc: Henrique de Moraes Holschuh
    Cc: Daniel Bilik
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Sasha Levin
    Cc: Ben Hutchings
    Cc: Thomas Gleixner
    Cc: Daniel Bilik
    Cc: Jiri Slaby
    Cc: Michal Hocko

    Tejun Heo
     

30 Jan, 2016

1 commit

  • fca839c00a12 ("workqueue: warn if memory reclaim tries to flush
    !WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
    triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
    flush a !WQ_MEM_RECLAIM workquee.

    This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
    reclaim path and making it depend on something which may need more
    memory to make forward progress can lead to deadlocks. Unfortunately,
    workqueues created with the legacy create*_workqueue() interface
    always have WQ_MEM_RECLAIM regardless of whether they are depended
    upon memory reclaim or not. These spurious WQ_MEM_RECLAIM markings
    cause spurious triggering of the flush dependency checks.

    WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
    workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
    ...
    Workqueue: deferwq deferred_probe_work_func
    [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [] (show_stack) from [] (dump_stack+0x94/0xd4)
    [] (dump_stack) from [] (warn_slowpath_common+0x80/0xb0)
    [] (warn_slowpath_common) from [] (warn_slowpath_fmt+0x30/0x40)
    [] (warn_slowpath_fmt) from [] (check_flush_dependency+0x138/0x144)
    [] (check_flush_dependency) from [] (flush_work+0x50/0x15c)
    [] (flush_work) from [] (lru_add_drain_all+0x130/0x180)
    [] (lru_add_drain_all) from [] (migrate_prep+0x8/0x10)
    [] (migrate_prep) from [] (alloc_contig_range+0xd8/0x338)
    [] (alloc_contig_range) from [] (cma_alloc+0xe0/0x1ac)
    [] (cma_alloc) from [] (__alloc_from_contiguous+0x38/0xd8)
    [] (__alloc_from_contiguous) from [] (__dma_alloc+0x240/0x278)
    [] (__dma_alloc) from [] (arm_dma_alloc+0x54/0x5c)
    [] (arm_dma_alloc) from [] (dmam_alloc_coherent+0xc0/0xec)
    [] (dmam_alloc_coherent) from [] (ahci_port_start+0x150/0x1dc)
    [] (ahci_port_start) from [] (ata_host_start.part.3+0xc8/0x1c8)
    [] (ata_host_start.part.3) from [] (ata_host_activate+0x50/0x148)
    [] (ata_host_activate) from [] (ahci_host_activate+0x44/0x114)
    [] (ahci_host_activate) from [] (ahci_platform_init_host+0x1d8/0x3c8)
    [] (ahci_platform_init_host) from [] (tegra_ahci_probe+0x448/0x4e8)
    [] (tegra_ahci_probe) from [] (platform_drv_probe+0x50/0xac)
    [] (platform_drv_probe) from [] (driver_probe_device+0x214/0x2c0)
    [] (driver_probe_device) from [] (bus_for_each_drv+0x60/0x94)
    [] (bus_for_each_drv) from [] (__device_attach+0xb0/0x114)
    [] (__device_attach) from [] (bus_probe_device+0x84/0x8c)
    [] (bus_probe_device) from [] (deferred_probe_work_func+0x68/0x98)
    [] (deferred_probe_work_func) from [] (process_one_work+0x120/0x3f8)
    [] (process_one_work) from [] (worker_thread+0x38/0x55c)
    [] (worker_thread) from [] (kthread+0xdc/0xf4)
    [] (kthread) from [] (ret_from_fork+0x14/0x3c)

    Fix it by marking workqueues created via create*_workqueue() with
    __WQ_LEGACY and disabling flush dependency checks on them.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Thierry Reding
    Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
    Fixes: fca839c00a12 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")

    Tejun Heo
     

08 Jan, 2016

1 commit


09 Dec, 2015

2 commits

  • Workqueue stalls can happen from a variety of usage bugs such as
    missing WQ_MEM_RECLAIM flag or concurrency managed work item
    indefinitely staying RUNNING. These stalls can be extremely difficult
    to hunt down because the usual warning mechanisms can't detect
    workqueue stalls and the internal state is pretty opaque.

    To alleviate the situation, this patch implements workqueue lockup
    detector. It periodically monitors all worker_pools periodically and,
    if any pool failed to make forward progress longer than the threshold
    duration, triggers warning and dumps workqueue state as follows.

    BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
    Showing busy workqueues and worker pools:
    workqueue events: flags=0x0
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
    pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
    workqueue events_power_efficient: flags=0x80
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    pending: check_lifetime, neigh_periodic_work
    workqueue cgroup_pidlist_destroy: flags=0x0
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
    pending: cgroup_pidlist_destroy_work_fn
    ...

    The detection mechanism is controller through kernel parameter
    workqueue.watchdog_thresh and can be updated at runtime through the
    sysfs module parameter file.

    v2: Decoupled from softlockup control knobs.

    Signed-off-by: Tejun Heo
    Acked-by: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Michal Hocko
    Cc: Chris Mason
    Cc: Andrew Morton

    Tejun Heo
     
  • Task or work item involved in memory reclaim trying to flush a
    non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
    deadlock. Trigger WARN_ONCE() if such conditions are detected.

    Signed-off-by: Tejun Heo
    Cc: Peter Zijlstra

    Tejun Heo
     

06 Nov, 2015

1 commit

  • Pull workqueue update from Tejun Heo:
    "This pull request contains one patch to make an unbound worker pool
    allocated from the NUMA node containing it if such node exists. As
    unbound worker pools are node-affine by default, this makes most pools
    allocated on the right node"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Allocate the unbound pool using local node memory

    Linus Torvalds
     

13 Oct, 2015

1 commit


01 Oct, 2015

1 commit

  • My system keeps crashing with below message. vmstat_update() schedules a delayed
    work in current cpu and expects the work runs in the cpu.
    schedule_delayed_work() is expected to make delayed work run in local cpu. The
    problem is timer can be migrated with NO_HZ. __queue_work() queues work in
    timer handler, which could run in a different cpu other than where the delayed
    work is scheduled. The end result is the delayed work runs in different cpu.
    The patch makes __queue_delayed_work records local cpu earlier. Where the timer
    runs doesn't change where the work runs with the change.

    [ 28.010131] ------------[ cut here ]------------
    [ 28.010609] kernel BUG at ../mm/vmstat.c:1392!
    [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [ 28.011860] Modules linked in:
    [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634
    [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
    [ 28.014160] Workqueue: events vmstat_update
    [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
    [ 28.015445] RIP: 0010:[] []vmstat_update+0x31/0x80
    [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297
    [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
    [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
    [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
    [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
    [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
    [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
    [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
    [ 28.020071] Stack:
    [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
    [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
    [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
    [ 28.020071] Call Trace:
    [ 28.020071] [] process_one_work+0x1c8/0x540
    [ 28.020071] [] ? process_one_work+0x14b/0x540
    [ 28.020071] [] worker_thread+0x114/0x460
    [ 28.020071] [] ? process_one_work+0x540/0x540
    [ 28.020071] [] kthread+0xf8/0x110
    [ 28.020071] [] ?kthread_create_on_node+0x200/0x200
    [ 28.020071] [] ret_from_fork+0x3f/0x70
    [ 28.020071] [] ?kthread_create_on_node+0x200/0x200

    Signed-off-by: Shaohua Li
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v2.6.31+

    Shaohua Li
     

02 Sep, 2015

1 commit


01 Sep, 2015

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The biggest change in this cycle is the rewrite of the main SMP load
    balancing metric: the CPU load/utilization. The main goal was to make
    the metric more precise and more representative - see the changelog of
    this commit for the gory details:

    9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

    It is done in a way that significantly reduces complexity of the code:

    5 files changed, 249 insertions(+), 494 deletions(-)

    and the performance testing results are encouraging. Nevertheless we
    need to keep an eye on potential regressions, since this potentially
    affects every SMP workload in existence.

    This work comes from Yuyang Du.

    Other changes:

    - SCHED_DL updates. (Andrea Parri)

    - Simplify architecture callbacks by removing finish_arch_switch().
    (Peter Zijlstra et al)

    - cputime accounting: guarantee stime + utime == rtime. (Peter
    Zijlstra)

    - optimize idle CPU wakeups some more - inspired by Facebook server
    loads. (Mike Galbraith)

    - stop_machine fixes and updates. (Oleg Nesterov)

    - Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)

    - sched/numa tweaks. (Srikar Dronamraju)

    - misc fixes and small cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    sched/deadline: Fix comment in enqueue_task_dl()
    sched/deadline: Fix comment in push_dl_tasks()
    sched: Change the sched_class::set_cpus_allowed() calling context
    sched: Make sched_class::set_cpus_allowed() unconditional
    sched: Fix a race between __kthread_bind() and sched_setaffinity()
    sched: Ensure a task has a non-normalized vruntime when returning back to CFS
    sched/numa: Fix NUMA_DIRECT topology identification
    tile: Reorganize _switch_to()
    sched, sparc32: Update scheduler comments in copy_thread()
    sched: Remove finish_arch_switch()
    sched, tile: Remove finish_arch_switch
    sched, sh: Fold finish_arch_switch() into switch_to()
    sched, score: Remove finish_arch_switch()
    sched, avr32: Remove finish_arch_switch()
    sched, MIPS: Get rid of finish_arch_switch()
    sched, arm: Remove finish_arch_switch()
    sched/fair: Clean up load average references
    sched/fair: Provide runnable_load_avg back to cfs_rq
    sched/fair: Remove task and group entity load when they are dead
    sched/fair: Init cfs_rq's sched_entity load average
    ...

    Linus Torvalds
     

12 Aug, 2015

1 commit

  • Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
    without locks, a caller might observe an old value and race with the
    set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
    it:

    __kthread_bind()
    do_set_cpus_allowed()

    sched_setaffinity()
    if (p->flags & PF_NO_SETAFFINITIY)
    set_cpus_allowed_ptr()
    p->flags |= PF_NO_SETAFFINITY

    Fix the bug by putting everything under the regular scheduler locks.

    This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dedekind1@gmail.com
    Cc: juri.lelli@arm.com
    Cc: mgorman@suse.de
    Cc: riel@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Aug, 2015

1 commit

  • Commit 37b1ef31a568fc02e53587620226e5f3c66454c8 ("workqueue: move
    flush_scheduled_work() to workqueue.h") moved the exported non GPL
    flush_scheduled_work() from a function to an inline wrapper.
    Unfortunately, it directly calls flush_workqueue() which is a GPL function.
    This has the effect of changing the licensing requirement for this function
    and makes it unavailable to non GPL modules.

    See commit ad7b1f841f8a54c6d61ff181451f55b68175e15a ("workqueue: Make
    schedule_work() available again to non GPL modules") for precedent.

    Signed-off-by: Tim Gardner
    Signed-off-by: Tejun Heo

    Tim Gardner
     

23 Jul, 2015

1 commit


02 Jul, 2015

1 commit

  • Pull module updates from Rusty Russell:
    "Main excitement here is Peter Zijlstra's lockless rbtree optimization
    to speed module address lookup. He found some abusers of the module
    lock doing that too.

    A little bit of parameter work here too; including Dan Streetman's
    breaking up the big param mutex so writing a parameter can load
    another module (yeah, really). Unfortunately that broke the usual
    suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
    appended too"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
    modules: only use mod->param_lock if CONFIG_MODULES
    param: fix module param locks when !CONFIG_SYSFS.
    rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
    module: add per-module param_lock
    module: make perm const
    params: suppress unused variable error, warn once just in case code changes.
    modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
    kernel/module.c: avoid ifdefs for sig_enforce declaration
    kernel/workqueue.c: remove ifdefs over wq_power_efficient
    kernel/params.c: export param_ops_bool_enable_only
    kernel/params.c: generalize bool_enable_only
    kernel/module.c: use generic module param operaters for sig_enforce
    kernel/params: constify struct kernel_param_ops uses
    sysfs: tightened sysfs permission checks
    module: Rework module_addr_{min,max}
    module: Use __module_address() for module_address_lookup()
    module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
    module: Optimize __module_address() using a latched RB-tree
    rbtree: Implement generic latch_tree
    seqlock: Introduce raw_read_seqcount_latch()
    ...

    Linus Torvalds
     

29 May, 2015

1 commit


28 May, 2015

1 commit


22 May, 2015

3 commits