19 Mar, 2016

1 commit


16 Mar, 2016

1 commit

  • $ make tags
    GEN tags
    ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
    ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
    ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
    ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
    ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
    ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
    ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

    Which are all the result of the DEFINE_PER_CPU pattern:

    scripts/tags.sh:200: '/\
    Acked-by: David S. Miller
    Acked-by: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

12 Mar, 2016

1 commit


02 Mar, 2016

1 commit


18 Feb, 2016

1 commit


11 Feb, 2016

1 commit

  • When looking up the pool_workqueue to use for an unbound workqueue,
    workqueue assumes that the target CPU is always bound to a valid NUMA
    node. However, currently, when a CPU goes offline, the mapping is
    destroyed and cpu_to_node() returns NUMA_NO_NODE.

    This has always been broken but hasn't triggered often enough before
    874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
    After the commit, workqueue forcifully assigns the local CPU for
    delayed work items without explicit target CPU to fix a different
    issue. This widens the window where CPU can go offline while a
    delayed work item is pending causing delayed work items dispatched
    with target CPU set to an already offlined CPU. The resulting
    NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
    NULL pool_workqueue and thus crash.

    While 874bbfe600a6 has been reverted for a different reason making the
    bug less visible again, it can still happen. Fix it by mapping
    NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
    This is a temporary workaround. The long term solution is keeping CPU
    -> NODE mapping stable across CPU off/online cycles which is being
    worked on.

    Signed-off-by: Tejun Heo
    Reported-by: Mike Galbraith
    Cc: Tang Chen
    Cc: Rafael J. Wysocki
    Cc: Len Brown
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
    Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com

    Tejun Heo
     

10 Feb, 2016

3 commits

  • Workqueue used to guarantee local execution for work items queued
    without explicit target CPU. The guarantee is gone now which can
    break some usages in subtle ways. To flush out those cases, this
    patch implements a debug feature which forces round-robin CPU
    selection for all such work items.

    The debug feature defaults to off and can be enabled with a kernel
    parameter. The default can be flipped with a debug config option.

    If you hit this commit during bisection, please refer to 041bd12e272c
    ("Revert "workqueue: make sure delayed work run in local cpu"") for
    more information and ping me.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • WORK_CPU_UNBOUND work items queued to a bound workqueue always run
    locally. This is a good thing normally, but not when the user has
    asked us to keep unbound work away from certain CPUs. Round robin
    these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
    trumps performance.

    tj: Cosmetic and comment changes. WARN_ON_ONCE() dropped from empty
    (wq_unbound_cpumask AND cpu_online_mask). If we want that, it
    should be done when config changes.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Tejun Heo

    Mike Galbraith
     
  • This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76.

    Workqueue used to implicity guarantee that work items queued without
    explicit CPU specified are put on the local CPU. Recent changes in
    timer broke the guarantee and led to vmstat breakage which was fixed
    by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU
    we need it to run on").

    vmstat is the most likely to expose the issue and it's quite possible
    that there are other similar problems which are a lot more difficult
    to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make
    sure delayed work run in local cpu") was applied to restore the local
    CPU guarnatee. Unfortunately, the change exposed a bug in timer code
    which got fixed by 22b886dd1018 ("timers: Use proper base migration in
    add_timer_on()"). Due to code restructuring, the commit couldn't be
    backported beyond certain point and stable kernels which only had
    874bbfe600a6 started crashing.

    The local CPU guarantee was accidental more than anything else and we
    want to get rid of it anyway. As, with the vmstat case fixed,
    874bbfe600a6 is causing more problems than it's fixing, it has been
    decided to take the chance and officially break the guarantee by
    reverting the commit. A debug feature will be added to force foreign
    CPU assignment to expose cases relying on the guarantee and fixes for
    the individual cases will be backported to stable as necessary.

    Signed-off-by: Tejun Heo
    Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu")
    Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
    Cc: stable@vger.kernel.org
    Cc: Mike Galbraith
    Cc: Henrique de Moraes Holschuh
    Cc: Daniel Bilik
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Sasha Levin
    Cc: Ben Hutchings
    Cc: Thomas Gleixner
    Cc: Daniel Bilik
    Cc: Jiri Slaby
    Cc: Michal Hocko

    Tejun Heo
     

30 Jan, 2016

1 commit

  • fca839c00a12 ("workqueue: warn if memory reclaim tries to flush
    !WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
    triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
    flush a !WQ_MEM_RECLAIM workquee.

    This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
    reclaim path and making it depend on something which may need more
    memory to make forward progress can lead to deadlocks. Unfortunately,
    workqueues created with the legacy create*_workqueue() interface
    always have WQ_MEM_RECLAIM regardless of whether they are depended
    upon memory reclaim or not. These spurious WQ_MEM_RECLAIM markings
    cause spurious triggering of the flush dependency checks.

    WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
    workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
    ...
    Workqueue: deferwq deferred_probe_work_func
    [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [] (show_stack) from [] (dump_stack+0x94/0xd4)
    [] (dump_stack) from [] (warn_slowpath_common+0x80/0xb0)
    [] (warn_slowpath_common) from [] (warn_slowpath_fmt+0x30/0x40)
    [] (warn_slowpath_fmt) from [] (check_flush_dependency+0x138/0x144)
    [] (check_flush_dependency) from [] (flush_work+0x50/0x15c)
    [] (flush_work) from [] (lru_add_drain_all+0x130/0x180)
    [] (lru_add_drain_all) from [] (migrate_prep+0x8/0x10)
    [] (migrate_prep) from [] (alloc_contig_range+0xd8/0x338)
    [] (alloc_contig_range) from [] (cma_alloc+0xe0/0x1ac)
    [] (cma_alloc) from [] (__alloc_from_contiguous+0x38/0xd8)
    [] (__alloc_from_contiguous) from [] (__dma_alloc+0x240/0x278)
    [] (__dma_alloc) from [] (arm_dma_alloc+0x54/0x5c)
    [] (arm_dma_alloc) from [] (dmam_alloc_coherent+0xc0/0xec)
    [] (dmam_alloc_coherent) from [] (ahci_port_start+0x150/0x1dc)
    [] (ahci_port_start) from [] (ata_host_start.part.3+0xc8/0x1c8)
    [] (ata_host_start.part.3) from [] (ata_host_activate+0x50/0x148)
    [] (ata_host_activate) from [] (ahci_host_activate+0x44/0x114)
    [] (ahci_host_activate) from [] (ahci_platform_init_host+0x1d8/0x3c8)
    [] (ahci_platform_init_host) from [] (tegra_ahci_probe+0x448/0x4e8)
    [] (tegra_ahci_probe) from [] (platform_drv_probe+0x50/0xac)
    [] (platform_drv_probe) from [] (driver_probe_device+0x214/0x2c0)
    [] (driver_probe_device) from [] (bus_for_each_drv+0x60/0x94)
    [] (bus_for_each_drv) from [] (__device_attach+0xb0/0x114)
    [] (__device_attach) from [] (bus_probe_device+0x84/0x8c)
    [] (bus_probe_device) from [] (deferred_probe_work_func+0x68/0x98)
    [] (deferred_probe_work_func) from [] (process_one_work+0x120/0x3f8)
    [] (process_one_work) from [] (worker_thread+0x38/0x55c)
    [] (worker_thread) from [] (kthread+0xdc/0xf4)
    [] (kthread) from [] (ret_from_fork+0x14/0x3c)

    Fix it by marking workqueues created via create*_workqueue() with
    __WQ_LEGACY and disabling flush dependency checks on them.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Thierry Reding
    Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
    Fixes: fca839c00a12 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")

    Tejun Heo
     

08 Jan, 2016

1 commit


09 Dec, 2015

2 commits

  • Workqueue stalls can happen from a variety of usage bugs such as
    missing WQ_MEM_RECLAIM flag or concurrency managed work item
    indefinitely staying RUNNING. These stalls can be extremely difficult
    to hunt down because the usual warning mechanisms can't detect
    workqueue stalls and the internal state is pretty opaque.

    To alleviate the situation, this patch implements workqueue lockup
    detector. It periodically monitors all worker_pools periodically and,
    if any pool failed to make forward progress longer than the threshold
    duration, triggers warning and dumps workqueue state as follows.

    BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
    Showing busy workqueues and worker pools:
    workqueue events: flags=0x0
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
    pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
    workqueue events_power_efficient: flags=0x80
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    pending: check_lifetime, neigh_periodic_work
    workqueue cgroup_pidlist_destroy: flags=0x0
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
    pending: cgroup_pidlist_destroy_work_fn
    ...

    The detection mechanism is controller through kernel parameter
    workqueue.watchdog_thresh and can be updated at runtime through the
    sysfs module parameter file.

    v2: Decoupled from softlockup control knobs.

    Signed-off-by: Tejun Heo
    Acked-by: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Michal Hocko
    Cc: Chris Mason
    Cc: Andrew Morton

    Tejun Heo
     
  • Task or work item involved in memory reclaim trying to flush a
    non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
    deadlock. Trigger WARN_ONCE() if such conditions are detected.

    Signed-off-by: Tejun Heo
    Cc: Peter Zijlstra

    Tejun Heo
     

06 Nov, 2015

1 commit

  • Pull workqueue update from Tejun Heo:
    "This pull request contains one patch to make an unbound worker pool
    allocated from the NUMA node containing it if such node exists. As
    unbound worker pools are node-affine by default, this makes most pools
    allocated on the right node"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Allocate the unbound pool using local node memory

    Linus Torvalds
     

13 Oct, 2015

1 commit


01 Oct, 2015

1 commit

  • My system keeps crashing with below message. vmstat_update() schedules a delayed
    work in current cpu and expects the work runs in the cpu.
    schedule_delayed_work() is expected to make delayed work run in local cpu. The
    problem is timer can be migrated with NO_HZ. __queue_work() queues work in
    timer handler, which could run in a different cpu other than where the delayed
    work is scheduled. The end result is the delayed work runs in different cpu.
    The patch makes __queue_delayed_work records local cpu earlier. Where the timer
    runs doesn't change where the work runs with the change.

    [ 28.010131] ------------[ cut here ]------------
    [ 28.010609] kernel BUG at ../mm/vmstat.c:1392!
    [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [ 28.011860] Modules linked in:
    [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634
    [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
    [ 28.014160] Workqueue: events vmstat_update
    [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
    [ 28.015445] RIP: 0010:[] []vmstat_update+0x31/0x80
    [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297
    [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
    [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
    [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
    [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
    [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
    [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
    [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
    [ 28.020071] Stack:
    [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
    [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
    [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
    [ 28.020071] Call Trace:
    [ 28.020071] [] process_one_work+0x1c8/0x540
    [ 28.020071] [] ? process_one_work+0x14b/0x540
    [ 28.020071] [] worker_thread+0x114/0x460
    [ 28.020071] [] ? process_one_work+0x540/0x540
    [ 28.020071] [] kthread+0xf8/0x110
    [ 28.020071] [] ?kthread_create_on_node+0x200/0x200
    [ 28.020071] [] ret_from_fork+0x3f/0x70
    [ 28.020071] [] ?kthread_create_on_node+0x200/0x200

    Signed-off-by: Shaohua Li
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v2.6.31+

    Shaohua Li
     

02 Sep, 2015

1 commit


01 Sep, 2015

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The biggest change in this cycle is the rewrite of the main SMP load
    balancing metric: the CPU load/utilization. The main goal was to make
    the metric more precise and more representative - see the changelog of
    this commit for the gory details:

    9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

    It is done in a way that significantly reduces complexity of the code:

    5 files changed, 249 insertions(+), 494 deletions(-)

    and the performance testing results are encouraging. Nevertheless we
    need to keep an eye on potential regressions, since this potentially
    affects every SMP workload in existence.

    This work comes from Yuyang Du.

    Other changes:

    - SCHED_DL updates. (Andrea Parri)

    - Simplify architecture callbacks by removing finish_arch_switch().
    (Peter Zijlstra et al)

    - cputime accounting: guarantee stime + utime == rtime. (Peter
    Zijlstra)

    - optimize idle CPU wakeups some more - inspired by Facebook server
    loads. (Mike Galbraith)

    - stop_machine fixes and updates. (Oleg Nesterov)

    - Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)

    - sched/numa tweaks. (Srikar Dronamraju)

    - misc fixes and small cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    sched/deadline: Fix comment in enqueue_task_dl()
    sched/deadline: Fix comment in push_dl_tasks()
    sched: Change the sched_class::set_cpus_allowed() calling context
    sched: Make sched_class::set_cpus_allowed() unconditional
    sched: Fix a race between __kthread_bind() and sched_setaffinity()
    sched: Ensure a task has a non-normalized vruntime when returning back to CFS
    sched/numa: Fix NUMA_DIRECT topology identification
    tile: Reorganize _switch_to()
    sched, sparc32: Update scheduler comments in copy_thread()
    sched: Remove finish_arch_switch()
    sched, tile: Remove finish_arch_switch
    sched, sh: Fold finish_arch_switch() into switch_to()
    sched, score: Remove finish_arch_switch()
    sched, avr32: Remove finish_arch_switch()
    sched, MIPS: Get rid of finish_arch_switch()
    sched, arm: Remove finish_arch_switch()
    sched/fair: Clean up load average references
    sched/fair: Provide runnable_load_avg back to cfs_rq
    sched/fair: Remove task and group entity load when they are dead
    sched/fair: Init cfs_rq's sched_entity load average
    ...

    Linus Torvalds
     

12 Aug, 2015

1 commit

  • Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
    without locks, a caller might observe an old value and race with the
    set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
    it:

    __kthread_bind()
    do_set_cpus_allowed()

    sched_setaffinity()
    if (p->flags & PF_NO_SETAFFINITIY)
    set_cpus_allowed_ptr()
    p->flags |= PF_NO_SETAFFINITY

    Fix the bug by putting everything under the regular scheduler locks.

    This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dedekind1@gmail.com
    Cc: juri.lelli@arm.com
    Cc: mgorman@suse.de
    Cc: riel@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Aug, 2015

1 commit

  • Commit 37b1ef31a568fc02e53587620226e5f3c66454c8 ("workqueue: move
    flush_scheduled_work() to workqueue.h") moved the exported non GPL
    flush_scheduled_work() from a function to an inline wrapper.
    Unfortunately, it directly calls flush_workqueue() which is a GPL function.
    This has the effect of changing the licensing requirement for this function
    and makes it unavailable to non GPL modules.

    See commit ad7b1f841f8a54c6d61ff181451f55b68175e15a ("workqueue: Make
    schedule_work() available again to non GPL modules") for precedent.

    Signed-off-by: Tim Gardner
    Signed-off-by: Tejun Heo

    Tim Gardner
     

23 Jul, 2015

1 commit


02 Jul, 2015

1 commit

  • Pull module updates from Rusty Russell:
    "Main excitement here is Peter Zijlstra's lockless rbtree optimization
    to speed module address lookup. He found some abusers of the module
    lock doing that too.

    A little bit of parameter work here too; including Dan Streetman's
    breaking up the big param mutex so writing a parameter can load
    another module (yeah, really). Unfortunately that broke the usual
    suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
    appended too"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
    modules: only use mod->param_lock if CONFIG_MODULES
    param: fix module param locks when !CONFIG_SYSFS.
    rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
    module: add per-module param_lock
    module: make perm const
    params: suppress unused variable error, warn once just in case code changes.
    modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
    kernel/module.c: avoid ifdefs for sig_enforce declaration
    kernel/workqueue.c: remove ifdefs over wq_power_efficient
    kernel/params.c: export param_ops_bool_enable_only
    kernel/params.c: generalize bool_enable_only
    kernel/module.c: use generic module param operaters for sig_enforce
    kernel/params: constify struct kernel_param_ops uses
    sysfs: tightened sysfs permission checks
    module: Rework module_addr_{min,max}
    module: Use __module_address() for module_address_lookup()
    module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
    module: Optimize __module_address() using a latched RB-tree
    rbtree: Implement generic latch_tree
    seqlock: Introduce raw_read_seqcount_latch()
    ...

    Linus Torvalds
     

29 May, 2015

1 commit


28 May, 2015

1 commit


22 May, 2015

3 commits


20 May, 2015

2 commits

  • Current modification to attrs via sysfs is not fully synchronized.

    Process A (change cpumask) | Process B (change numa affinity)
    wq_cpumask_store() |
    wq_sysfs_prep_attrs() |
    | apply_workqueue_attrs()
    apply_workqueue_attrs() |

    It results that the Process B's operation is totally reverted
    without any notification, it is a buggy behavior. So this patch
    moves wq_sysfs_prep_attrs() into the protection under wq_pool_mutex
    to ensure attrs changes are properly synchronized.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Applying attrs requires two locks: get_online_cpus() and wq_pool_mutex,
    and this code is duplicated at two places (apply_workqueue_attrs() and
    workqueue_set_unbound_cpumask()). So we separate out this locking
    code into apply_wqattrs_[un]lock() and do a minor refactor on
    apply_workqueue_attrs().

    The apply_wqattrs_[un]lock() will be also used on later patch for
    ensuring attrs changes are properly synchronized.

    tj: minor updates to comments

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

19 May, 2015

2 commits

  • wq_update_unbound_numa() is known be called with wq_pool_mutex held.

    But wq_update_unbound_numa() requests wq->mutex before reading
    wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq. But these fields
    were changed to be allowed being read with wq_pool_mutex held. So we
    simply remove the mutex_lock(&wq->mutex).

    Without the dependence on the the mutex_lock(&wq->mutex), the test
    of wq->unbound_attrs->no_numa can also be moved upward.

    The old code need a long comment to describe the stableness of
    @wq->unbound_attrs which is also guaranteed by wq_pool_mutex now,
    so we don't need this such comment.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Current wq_pool_mutex doesn't proctect the attrs-installation, it results
    that ->unbound_attrs, ->numa_pwq_tbl[] and ->dfl_pwq can only be accessed
    under wq->mutex and causes some inconveniences. Example, wq_update_unbound_numa()
    has to acquire wq->mutex before fetching the wq->unbound_attrs->no_numa
    and the old_pwq.

    attrs-installation is a short operation, so this change will no cause any
    latency for other operations which also acquire the wq_pool_mutex.

    The only unprotected attrs-installation code is in apply_workqueue_attrs(),
    so this patch touches code less than comments.

    It is also a preparation patch for next several patches which read
    wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq with
    only wq_pool_mutex held.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

13 May, 2015

1 commit


11 May, 2015

1 commit


30 Apr, 2015

1 commit

  • Allow to modify the low-level unbound workqueues cpumask through
    sysfs. This is performed by traversing the entire workqueue list
    and calling apply_wqattrs_prepare() on the unbound workqueues
    with the new low level mask. Only after all the preparation are done,
    we commit them all together.

    Ordered workqueues are ignored from the low level unbound workqueue
    cpumask, it will be handled in near future.

    All the (default & per-node) pwqs are mandatorily controlled by
    the low level cpumask. If the user configured cpumask doesn't overlap
    with the low level cpumask, the low level cpumask will be used for the
    wq instead.

    The comment of wq_calc_node_cpumask() is updated and explicitly
    requires that its first argument should be the attrs of the default
    pwq.

    The default wq_unbound_cpumask is cpu_possible_mask. The workqueue
    subsystem doesn't know its best default value, let the system manager
    or the other subsystem set it when needed.

    Changed from V8:
    merge the calculating code for the attrs of the default pwq together.
    minor change the code&comments for saving the user configured attrs.
    remove unnecessary list_del().
    minor update the comment of wq_calc_node_cpumask().
    update the comment of workqueue_set_unbound_cpumask();

    Cc: Christoph Lameter
    Cc: Kevin Hilman
    Cc: Lai Jiangshan
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Tejun Heo
    Cc: Viresh Kumar
    Cc: Frederic Weisbecker
    Original-patch-by: Frederic Weisbecker
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

27 Apr, 2015

2 commits

  • Create a cpumask that limits the affinity of all unbound workqueues.
    This cpumask is controlled through a file at the root of the workqueue
    sysfs directory.

    It works on a lower-level than the per WQ_SYSFS workqueues cpumask files
    such that the effective cpumask applied for a given unbound workqueue is
    the intersection of /sys/devices/virtual/workqueue/$WORKQUEUE/cpumask and
    the new /sys/devices/virtual/workqueue/cpumask file.

    This patch implements the basic infrastructure and the read interface.
    wq_unbound_cpumask is initially set to cpu_possible_mask.

    Cc: Christoph Lameter
    Cc: Kevin Hilman
    Cc: Lai Jiangshan
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Tejun Heo
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Frederic Weisbecker
     
  • Current apply_workqueue_attrs() includes pwqs-allocation and pwqs-installation,
    so when we batch multiple apply_workqueue_attrs()s as a transaction, we can't
    ensure the transaction must succeed or fail as a complete unit.

    To solve this, we split apply_workqueue_attrs() into three stages.
    The first stage does the preparation: allocation memory, pwqs.
    The second stage does the attrs-installaion and pwqs-installation.
    The third stage frees the allocated memory and (old or unused) pwqs.

    As the result, batching multiple apply_workqueue_attrs()s can
    succeed or fail as a complete unit:
    1) batch do all the first stage for all the workqueues
    2) only commit all when all the above succeed.

    This patch is a preparation for the next patch ("Allow modifying low level
    unbound workqueue cpumask") which will do a multiple apply_workqueue_attrs().

    The patch doesn't have functionality changed except two minor adjustment:
    1) free_unbound_pwq() for the error path is removed, we use the
    heavier version put_pwq_unlocked() instead since the error path
    is rare. this adjustment simplifies the code.
    2) the memory-allocation is also moved into wq_pool_mutex.
    this is needed to avoid to do the further splitting.

    tj: minor updates to comments.

    Suggested-by: Tejun Heo
    Cc: Christoph Lameter
    Cc: Kevin Hilman
    Cc: Lai Jiangshan
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Tejun Heo
    Cc: Viresh Kumar
    Cc: Frederic Weisbecker
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

06 Apr, 2015

1 commit

  • The sysfs code usually belongs to the botom of the file since it deals
    with high level objects. In the workqueue code it's misplaced and such
    that we'll need to work around functions references to allow the sysfs
    code to call APIs like apply_workqueue_attrs().

    Lets move that block further in the file, almost the botom.

    And declare workqueue_sysfs_unregister() just before destroy_workqueue()
    which reference it.

    tj: Moved workqueue_sysfs_unregister() forward declaration where other
    forward declarations are.

    Suggested-by: Tejun Heo
    Cc: Christoph Lameter
    Cc: Kevin Hilman
    Cc: Lai Jiangshan
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Tejun Heo
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Frederic Weisbecker
     

09 Mar, 2015

3 commits

  • Workqueues are used extensively throughout the kernel but sometimes
    it's difficult to debug stalls involving work items because visibility
    into its inner workings is fairly limited. Although sysrq-t task dump
    annotates each active worker task with the information on the work
    item being executed, it is challenging to find out which work items
    are pending or delayed on which queues and how pools are being
    managed.

    This patch implements show_workqueue_state() which dumps all busy
    workqueues and pools and is called from the sysrq-t handler. At the
    end of sysrq-t dump, something like the following is printed.

    Showing busy workqueues and worker pools:
    ...
    workqueue filler_wq: flags=0x0
    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
    in-flight: 491:filler_workfn, 507:filler_workfn
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    in-flight: 501:filler_workfn
    pending: filler_workfn
    ...
    workqueue test_wq: flags=0x8
    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
    in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
    delayed: test_workfn1 BAR(492), test_workfn2
    ...
    pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
    pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
    pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
    pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62

    The above shows that test_wq is executing test_workfn() on pid 510
    which is the rescuer and also that there are two tasks 69 and 500
    waiting for the work item to finish in flush_work(). As test_wq has
    max_active of 1, there are two work items for test_workfn1() and
    test_workfn2() which are delayed till the current work item is
    finished. In addition, pid 492 is flushing test_workfn1().

    The work item for test_workfn() is being executed on pwq of pool 2
    which is the normal priority per-cpu pool for CPU 1. The pool has
    three workers, two of which are executing filler_workfn() for
    filler_wq and the last one is assuming the manager role trying to
    create more workers.

    This extra workqueue state dump will hopefully help chasing down hangs
    involving workqueues.

    v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.

    v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
    printk()'s replaced with pr_info()'s, and cpumask printing now
    uses cpulist_pr_cont().

    Signed-off-by: Tejun Heo
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Andrew Morton
    CC: Ingo Molnar

    Tejun Heo
     
  • Add wq_barrier->task and worker_pool->manager to keep track of the
    flushing task and pool manager respectively. These are purely
    informational and will be used to implement sysrq dump of workqueues.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • The workqueues list is protected by wq_pool_mutex and a workqueue and
    its subordinate data structures are freed directly on destruction. We
    want to add the ability dump workqueues from a sysrq callback which
    requires walking all workqueues without grabbing wq_pool_mutex. This
    patch makes freeing of workqueues RCU protected and makes the
    workqueues list walkable while holding RCU read lock.

    Note that pool_workqueues and pools are already sched-RCU protected.
    For consistency, workqueues are also protected with sched-RCU.

    While at it, reverse the workqueues list so that a workqueue which is
    created earlier comes before. The order of the list isn't significant
    functionally but this makes the planned sysrq dump list system
    workqueues first.

    Signed-off-by: Tejun Heo

    Tejun Heo