13 Dec, 2012

1 commit


04 Dec, 2012

1 commit

  • 8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on
    0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in
    megaraid - it allocated work_struct, casted it to delayed_work and
    then pass that into queue_delayed_work().

    Previously, this was okay because 0 @delay short-circuited to
    queue_work() before doing anything with delayed_work. 8852aac25e
    moved 0 @delay test into __queue_delayed_work() after sanity check on
    delayed_work making megaraid trigger BUG_ON().

    Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix
    BUG_ON() from incorrect use of delayed work"), this patch converts
    BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such
    abusers, if there are more, trigger warning but don't crash the
    machine.

    Signed-off-by: Tejun Heo
    Cc: Xiaotian Feng

    Tejun Heo
     

02 Dec, 2012

4 commits

  • Recently, workqueue code has gone through some changes and we found
    some bugs related to concurrency management operations happening on
    the wrong CPU. When a worker is concurrency managed
    (!WORKER_NOT_RUNNIG), it should be bound to its associated cpu and
    woken up to that cpu. Add WARN_ON_ONCE() to verify this.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     
  • Return type of work_busy() is unsigned int.
    There is return statement returning boolean value, 'false' in work_busy().
    It is not problem, because 'false' may be treated '0'.
    However, fixing it would make code robust.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     
  • 8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()")
    implemented mod_delayed_work[_on]() using the improved
    try_to_grab_pending(). The function is later used, among others, to
    replace [__]candel_delayed_work() + queue_delayed_work() combinations.

    Unfortunately, a delayed_work item w/ zero @delay is handled slightly
    differently by mod_delayed_work_on() compared to
    queue_delayed_work_on(). The latter skips timer altogether and
    directly queues it using queue_work_on() while the former schedules
    timer which will expire on the closest tick. This means, when @delay
    is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
    makes the target item immediately executable while
    mod_delayed_work_on() may induce delay of upto a full tick.

    This somewhat subtle difference breaks some of the converted users.
    e.g. block queue plugging uses delayed_work for deferred processing
    and uses mod_delayed_work_on() when the queue needs to be immediately
    unplugged. The above problem manifested as noticeably higher number
    of context switches under certain circumstances.

    The difference in behavior was caused by missing special case handling
    for 0 delay in mod_delayed_work_on() compared to
    queue_delayed_work_on(). Joonsoo Kim posted a patch to add it -
    ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
    The patch was queued for 3.8 but it was described as optimization and
    I missed that it was a correctness issue.

    As both queue_delayed_work_on() and mod_delayed_work_on() use
    __queue_delayed_work() for queueing, it seems that the better approach
    is to move the 0 delay special handling to the function instead of
    duplicating it in mod_delayed_work_on().

    Fix the problem by moving 0 delay special case handling from
    queue_delayed_work_on() to __queue_delayed_work(). This replaces
    Joonsoo's patch.

    [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Anders Kaseorg
    Reported-and-tested-by: Zlatko Calusic
    LKML-Reference:
    LKML-Reference:
    Cc: Joonsoo Kim

    Tejun Heo
     
  • A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling
    off, never to be seen again. In the case where this occurred, an exiting
    thread hit reiserfs homebrew conditional resched while holding a mutex,
    bringing the box to its knees.

    PID: 18105 TASK: ffff8807fd412180 CPU: 5 COMMAND: "kdmflush"
    #0 [ffff8808157e7670] schedule at ffffffff8143f489
    #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs]
    #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14
    #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs]
    #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2
    #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41
    #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a
    #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88
    #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850
    #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f
    [exception RIP: kernel_thread_helper]
    RIP: ffffffff8144a5c0 RSP: ffff8808157e7f58 RFLAGS: 00000202
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff8107af60 RDI: ffff8803ee491d18
    RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

    Signed-off-by: Mike Galbraith
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Mike Galbraith
     

25 Oct, 2012

1 commit

  • 57b30ae77b ("workqueue: reimplement cancel_delayed_work() using
    try_to_grab_pending()") made cancel_delayed_work() always return %true
    unless someone else is also trying to cancel the work item, which is
    broken - if the target work item is idle, the return value should be
    %false.

    try_to_grab_pending() indicates that the target work item was idle by
    zero return value. Use it for return. Note that this brings
    cancel_delayed_work() in line with __cancel_work_timer() in return
    value handling.

    Signed-off-by: Dan Magenheimer
    Signed-off-by: Tejun Heo
    LKML-Reference:

    Dan Magenheimer
     

03 Oct, 2012

1 commit

  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

21 Sep, 2012

1 commit


20 Sep, 2012

3 commits

  • workqueue_set_max_active() may increase ->max_active without
    activating delayed works and may make the activation order differ from
    the queueing order. Both aren't strictly bugs but the resulting
    behavior could be a bit odd.

    To make things more consistent, use cwq_set_max_active() helper which
    immediately makes use of the newly increased max_mactive if there are
    delayed work items and also keeps the activation order.

    tj: Slight update to description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Using a helper instead of open code makes thaw_workqueues() clearer.
    The helper will also be used by the next patch.

    tj: Slight update to comment and description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • The existing work_on_cpu() implementation is hugely inefficient. It
    creates a new kthread, execute that single function and then let the
    kthread die on each invocation.

    Now that system_wq can handle concurrent executions, there's no
    advantage of doing this. Reimplement work_on_cpu() using system_wq
    which makes it simpler and way more efficient.

    stable: While this isn't a fix in itself, it's needed to fix a
    workqueue related bug in cpufreq/powernow-k8. AFAICS, this
    shouldn't break other existing users.

    Signed-off-by: Tejun Heo
    Acked-by: Jiri Kosina
    Cc: Linus Torvalds
    Cc: Bjorn Helgaas
    Cc: Len Brown
    Cc: Rafael J. Wysocki
    Cc: stable@vger.kernel.org

    Tejun Heo
     

19 Sep, 2012

8 commits

  • @delayed is now always false for all callers, remove it.

    tj: Updated description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Currently, when try_to_grab_pending() grabs a delayed work item, it
    leaves its linked work items alone on the delayed_works. The linked
    work items are always NO_COLOR and will cause future
    cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
    may cause the whole cwq to stall. For example,

    state: cwq->max_active = 1, cwq->nr_active = 1
    one work in cwq->pool, many in cwq->delayed_works.

    step1: try_to_grab_pending() removes a work item from delayed_works
    but leaves its NO_COLOR linked work items on it.

    step2: Later on, cwq_activate_first_delayed() activates the linked
    work item increasing ->nr_active.

    step3: cwq->nr_active = 1, but all activated work items of the cwq are
    NO_COLOR. When they finish, cwq->nr_active will not be
    decreased due to NO_COLOR, and no further work items will be
    activated from cwq->delayed_works. the cwq stalls.

    Fix it by ensuring the target work item is activated before stealing
    PENDING in try_to_grab_pending(). This ensures that all the linked
    work items are activated without incorrectly bumping cwq->nr_active.

    tj: Updated comment and description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Cc: stable@kernel.org

    Lai Jiangshan
     
  • workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so
    hotcpu_notifier() fits better than cpu_notifier().

    When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same.

    When HOTPLUG_CPU=n, if we use cpu_notifier(),
    workqueue_cpu_down_callback() will be called during boot to do
    nothing, and the memory of workqueue_cpu_down_callback() and
    gcwq_unbind_fn() will be discarded after boot.

    If we use hotcpu_notifier(), we can avoid the no-op call of
    workqueue_cpu_down_callback() and the memory of
    workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at
    build time:

    $ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
    -rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier
    -rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier

    $ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
    text data bss dec hex filename
    18513 2387 1221 22121 5669 kernel/workqueue.o.cpu_notifier
    18082 2355 1221 21658 549a kernel/workqueue.o.hotcpu_notifier

    tj: Updated description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • For workqueue hotplug callbacks, it makes less sense to use __devinit
    which discards the memory after boot if !HOTPLUG. __cpuinit, which
    discards the memory after boot if !HOTPLUG_CPU fits better.

    tj: Updated description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Now that manager_mutex's role has changed from synchronizing manager
    role to excluding hotplug against manager, the name is misleading.

    As it is protecting the CPU-association of the gcwq now, rename it to
    assoc_mutex.

    This patch is pure rename and doesn't introduce any functional change.

    tj: Updated comments and description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Now both worker destruction and idle rebinding remove the worker from
    idle list while it's still idle, so list_empty(&worker->entry) can be
    used to test whether either is pending and WORKER_DIE to distinguish
    between the two instead making WORKER_REBIND unnecessary.

    Use list_empty(&worker->entry) to determine whether destruction or
    rebinding is pending. This simplifies worker state transitions.

    WORKER_REBIND is not needed anymore. Remove it.

    tj: Updated comments and description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Because the old unbind/rebinding implementation wasn't atomic w.r.t.
    GCWQ_DISASSOCIATED manipulation which is protected by
    global_cwq->lock, we had to use two flags, WORKER_UNBOUND and
    WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with
    back-to-back CPU hotplug operations; otherwise, completion of
    rebinding while another unbinding is in progress could clear UNBIND
    prematurely.

    Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED,
    there's no need to use two flags. Just one is enough. Don't use
    WORKER_REBIND for busy rebinding.

    tj: Updated description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Currently rebind_workers() uses rebinds idle workers synchronously
    before proceeding to requesting busy workers to rebind. This is
    necessary because all workers on @worker_pool->idle_list must be bound
    before concurrency management local wake-ups from the busy workers
    take place.

    Unfortunately, the synchronous idle rebinding is quite complicated.
    This patch reimplements idle rebinding to simplify the code path.

    Rather than trying to make all idle workers bound before rebinding
    busy workers, we simply remove all to-be-bound idle workers from the
    idle list and let them add themselves back after completing rebinding
    (successful or not).

    As only workers which finished rebinding can on on the idle worker
    list, the idle worker list is guaranteed to have only bound workers
    unless CPU went down again and local wake-ups are safe.

    After the change, @worker_pool->nr_idle may deviate than the actual
    number of idle workers on @worker_pool->idle_list. More specifically,
    nr_idle may be non-zero while ->idle_list is empty. All users of
    ->nr_idle and ->idle_list are audited. The only affected one is
    too_many_workers() which is updated to check %false if ->idle_list is
    empty regardless of ->nr_idle.

    After this patch, rebind_workers() no longer performs the nasty
    idle-rebind retries which require temporary release of gcwq->lock, and
    both unbinding and rebinding are atomic w.r.t. global_cwq->lock.

    worker->idle_rebind and global_cwq->rebind_hold are now unnecessary
    and removed along with the definition of struct idle_rebind.

    Changed from V1:
    1) remove unlikely from too_many_workers(), ->idle_list can be empty
    anytime, even before this patch, no reason to use unlikely.
    2) fix a small rebasing mistake.
    (which is from rebasing the orignal fixing patch to for-next)
    3) add a lot of comments.
    4) clear WORKER_REBIND unconditionaly in idle_worker_rebind()

    tj: Updated comments and description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

18 Sep, 2012

2 commits

  • This merge is necessary as Lai's CPU hotplug restructuring series
    depends on the CPU hotplug bug fixes in for-3.6-fixes.

    The merge creates one trivial conflict between the following two
    commits.

    96e65306b8 "workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic"
    e2b6a6d570 "workqueue: use system_highpri_wq for highpri workers in rebind_workers()"

    Both add local variable definitions to the same block and can be
    merged in any order.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed
    (CPU is down again). This used to be okay because the flag wasn't
    used for anything else.

    However, after 25511a477 "workqueue: reimplement CPU online rebinding
    to handle idle workers", WORKER_REBIND is also used to command idle
    workers to rebind. If not cleared, the worker may confuse the next
    CPU_UP cycle by having REBIND spuriously set or oops / get stuck by
    prematurely calling idle_worker_rebind().

    WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5
    00()
    Hardware name: Bochs
    Modules linked in: test_wq(O-)
    Pid: 33, comm: kworker/1:1 Tainted: G O 3.6.0-rc1-work+ #3
    Call Trace:
    [] warn_slowpath_common+0x7f/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] worker_thread+0x4cd/0x500
    [] kthread+0xbe/0xd0
    [] kernel_thread_helper+0x4/0x10
    ---[ end trace e977cf20f4661968 ]---
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] worker_thread+0x360/0x500
    PGD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in: test_wq(O-)
    CPU 0
    Pid: 33, comm: kworker/1:1 Tainted: G W O 3.6.0-rc1-work+ #3 Bochs Bochs
    RIP: 0010:[] [] worker_thread+0x360/0x500
    RSP: 0018:ffff88001e1c9de0 EFLAGS: 00010086
    RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
    RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580
    R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900
    FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900)
    Stack:
    ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010
    ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900
    ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340
    Call Trace:
    [] kthread+0xbe/0xd0
    [] kernel_thread_helper+0x4/0x10
    Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b
    RIP [] worker_thread+0x360/0x500
    RSP
    CR2: 0000000000000000

    There was no reason to keep WORKER_REBIND on failure in the first
    place - WORKER_UNBOUND is guaranteed to be set in such cases
    preventing incorrectly activating concurrency management. Always
    clear WORKER_REBIND.

    tj: Updated comment and description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

11 Sep, 2012

2 commits

  • To simplify both normal and CPU hotplug paths, worker management is
    prevented while CPU hoplug is in progress. This is achieved by CPU
    hotplug holding the same exclusion mechanism used by workers to ensure
    there's only one manager per pool.

    If someone else seems to be performing the manager role, workers
    proceed to execute work items. CPU hotplug using the same mechanism
    can lead to idle worker depletion because all workers could proceed to
    execute work items while CPU hotplug is in progress and CPU hotplug
    itself wouldn't actually perform the worker management duty - it
    doesn't guarantee that there's an idle worker left when it releases
    management.

    This idle worker depletion, under extreme circumstances, can break
    forward-progress guarantee and thus lead to deadlock.

    This patch fixes the bug by using separate mechanisms for manager
    exclusion among workers and hotplug exclusion. For manager exclusion,
    POOL_MANAGING_WORKERS which was restored by the previous patch is
    used. pool->manager_mutex is now only used for exclusion between the
    elected manager and CPU hotplug. The elected manager won't proceed
    without holding pool->manager_mutex.

    This ensures that the worker which won the manager position can't skip
    managing while CPU hotplug is in progress. It will block on
    manager_mutex and perform management after CPU hotplug is complete.

    Note that hotplug may happen while waiting for manager_mutex. A
    manager isn't either on idle or busy list and thus the hoplug code
    can't unbind/rebind it. Make the manager handle its own un/rebinding.

    tj: Updated comment and description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • This patch restores POOL_MANAGING_WORKERS which was replaced by
    pool->manager_mutex by 6037315269 "workqueue: use mutex for global_cwq
    manager exclusion".

    There's a subtle idle worker depletion bug across CPU hotplug events
    and we need to distinguish an actual manager and CPU hotplug
    preventing management. POOL_MANAGING_WORKERS will be used for the
    former and manager_mutex the later.

    This patch just lays POOL_MANAGING_WORKERS on top of the existing
    manager_mutex and doesn't introduce any synchronization changes. The
    next patch will update it.

    Note that this patch fixes a non-critical anomaly where
    too_many_workers() may return %true spuriously while CPU hotplug is in
    progress. While the issue could schedule idle timer spuriously, it
    didn't trigger any actual misbehavior.

    tj: Rewrote patch description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

06 Sep, 2012

2 commits

  • Currently, rebind_workers() and idle_worker_rebind() are two-way
    interlocked. rebind_workers() waits for idle workers to finish
    rebinding and rebound idle workers wait for rebind_workers() to finish
    rebinding busy workers before proceeding.

    Unfortunately, this isn't enough. The second wait from idle workers
    is implemented as follows.

    wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));

    rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
    then returns. If CPU hotplug cycle happens again before one of the
    idle workers finishes the above wait_event(), rebind_workers() will
    repeat the first part of the handshake - set WORKER_REBIND again and
    wait for the idle worker to finish rebinding - and this leads to
    deadlock because the idle worker would be waiting for WORKER_REBIND to
    clear.

    This is fixed by adding another interlocking step at the end -
    rebind_workers() now waits for all the idle workers to finish the
    above WORKER_REBIND wait before returning. This ensures that all
    rebinding steps are complete on all idle workers before the next
    hotplug cycle can happen.

    This problem was diagnosed by Lai Jiangshan who also posted a patch to
    fix the issue, upon which this patch is based.

    This is the minimal fix and further patches are scheduled for the next
    merge window to simplify the CPU hotplug path.

    Signed-off-by: Tejun Heo
    Original-patch-by: Lai Jiangshan
    LKML-Reference:

    Tejun Heo
     
  • This doesn't make any functional difference and is purely to help the
    next patch to be simpler.

    Signed-off-by: Tejun Heo
    Cc: Lai Jiangshan

    Tejun Heo
     

05 Sep, 2012

1 commit

  • The compiler may compile the following code into TWO write/modify
    instructions.

    worker->flags &= ~WORKER_UNBOUND;
    worker->flags |= WORKER_REBIND;

    so the other CPU may temporarily see worker->flags which doesn't have
    either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
    prematurely.

    Fix it by using single explicit assignment via ACCESS_ONCE().

    Because idle workers have another WORKER_NOT_RUNNING flag, this bug
    doesn't exist for them; however, update it to use the same pattern for
    consistency.

    tj: Applied the change to idle workers too and updated comments and
    patch description a bit.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Lai Jiangshan
     

22 Aug, 2012

2 commits

  • cancel_delayed_work() can't be called from IRQ handlers due to its use
    of del_timer_sync() and can't cancel work items which are already
    transferred from timer to worklist.

    Also, unlike other flush and cancel functions, a canceled delayed_work
    would still point to the last associated cpu_workqueue. If the
    workqueue is destroyed afterwards and the work item is re-used on a
    different workqueue, the queueing code can oops trying to dereference
    already freed cpu_workqueue.

    This patch reimplements cancel_delayed_work() using
    try_to_grab_pending() and set_work_cpu_and_clear_pending(). This
    allows the function to be called from IRQ handlers and makes its
    behavior consistent with other flush / cancel functions.

    Signed-off-by: Tejun Heo
    Cc: Linus Torvalds
    Cc: Ingo Molnar
    Cc: Andrew Morton

    Tejun Heo
     
  • Up to now, for delayed_works, try_to_grab_pending() couldn't be used
    from IRQ handlers because IRQs may happen while
    delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN.

    This patch makes delayed_work use the new TIMER_IRQSAFE flag for
    delayed_work->timer. This makes try_to_grab_pending() and thus
    mod_delayed_work_on() safe to call from IRQ handlers.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

21 Aug, 2012

4 commits

  • Now that all workqueues are non-reentrant, system[_freezable]_wq() are
    equivalent to system_nrt[_freezable]_wq(). Replace the latter with
    wrappers around system[_freezable]_wq(). The wrapping goes through
    inline functions so that __deprecated can be added easily.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
    are equivalent to flush[_delayed]_work(). Drop the separate
    implementation and make them thin wrappers around
    flush[_delayed]_work().

    * start_flush_work() no longer takes @wait_executing as the only left
    user - flush_work() - always sets it to %true.

    * __cancel_work_timer() uses flush_work() instead of wait_on_work().

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • By default, each per-cpu part of a bound workqueue operates separately
    and a work item may be executing concurrently on different CPUs. The
    behavior avoids some cross-cpu traffic but leads to subtle weirdities
    and not-so-subtle contortions in the API.

    * There's no sane usefulness in allowing a single work item to be
    executed concurrently on multiple CPUs. People just get the
    behavior unintentionally and get surprised after learning about it.
    Most either explicitly synchronize or use non-reentrant/ordered
    workqueue but this is error-prone.

    * flush_work() can't wait for multiple instances of the same work item
    on different CPUs. If a work item is executing on cpu0 and then
    queued on cpu1, flush_work() can only wait for the one on cpu1.

    Unfortunately, work items can easily cross CPU boundaries
    unintentionally when the queueing thread gets migrated. This means
    that if multiple queuers compete, flush_work() can't even guarantee
    that the instance queued right before it is finished before
    returning.

    * flush_work_sync() was added to work around some of the deficiencies
    of flush_work(). In addition to the usual flushing, it ensures that
    all currently executing instances are finished before returning.
    This operation is expensive as it has to walk all CPUs and at the
    same time fails to address competing queuer case.

    Incorrectly using flush_work() when flush_work_sync() is necessary
    is an easy error to make and can lead to bugs which are difficult to
    reproduce.

    * Similar problems exist for flush_delayed_work[_sync]().

    Other than the cross-cpu access concern, there's no benefit in
    allowing parallel execution and it's plain silly to have this level of
    contortion for workqueue which is widely used from core code to
    extremely obscure drivers.

    This patch makes all workqueues non-reentrant. If a work item is
    executing on a different CPU when queueing is requested, it is always
    queued to that CPU. This guarantees that any given work item can be
    executing on one CPU at maximum and if a work item is queued and
    executing, both are on the same CPU.

    The only behavior change which may affect workqueue users negatively
    is that non-reentrancy overrides the affinity specified by
    queue_work_on(). On a reentrant workqueue, the affinity specified by
    queue_work_on() is always followed. Now, if the work item is
    executing on one of the CPUs, the work item will be queued there
    regardless of the requested affinity. I've reviewed all workqueue
    users which request explicit affinity, and, fortunately, none seems to
    be crazy enough to exploit parallel execution of the same work item.

    This adds an additional busy_hash lookup if the work item was
    previously queued on a different CPU. This shouldn't be noticeable
    under any sane workload. Work item queueing isn't a very
    high-frequency operation and they don't jump across CPUs all the time.
    In a micro benchmark to exaggerate this difference - measuring the
    time it takes for two work items to repeatedly jump between two CPUs a
    number (10M) of times with busy_hash table densely populated, the
    difference was around 3%.

    While the overhead is measureable, it is only visible in pathological
    cases and the difference isn't huge. This change brings much needed
    sanity to workqueue and makes its behavior consistent with timer. I
    think this is the right tradeoff to make.

    This enables significant simplification of workqueue API.
    Simplification patches will follow.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Fixed some checkpatch warnings.

    tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit.

    Signed-off-by: Valentin Ilie
    Signed-off-by: Tejun Heo
    LKML-Reference:

    Valentin Ilie
     

17 Aug, 2012

6 commits

  • To speed cpu down processing up, use system_highpri_wq.
    As scheduling priority of workers on it is higher than system_wq and
    it is not contended by other normal works on this cpu, work on it
    is processed faster than system_wq.

    tj: CPU up/downs care quite a bit about latency these days. This
    shouldn't hurt anything and makes sense.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     
  • In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
    Currently, in this case, we use only system_wq. This makes a possible
    error situation as there is mismatch between cwq->pool and worker->pool.

    To prevent this, we should use system_highpri_wq for highpri worker
    to match theses. This implements it.

    tj: Rephrased comment a bit.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     
  • Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
    WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
    for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
    or highpri worker. But, we don't consider this difference in rebind_workers(),
    we use just system_wq for highpri worker. It makes mismatch between
    cwq->pool and worker->pool.

    It doesn't make error in current implementation, but possible in the future.
    Now, we introduce system_highpri_wq to use proper cwq for highpri workers
    in rebind_workers(). Following patch fix this issue properly.

    tj: Even apart from rebinding, having system_highpri_wq generally
    makes sense.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     
  • We assign cpu id into work struct's data field in __queue_delayed_work_on().
    In current implementation, when work is come in first time,
    current running cpu id is assigned.
    If we do __queue_delayed_work_on() with CPU A on CPU B,
    __queue_work() invoked in delayed_work_timer_fn() go into
    the following sub-optimal path in case of WQ_NON_REENTRANT.

    gcwq = get_gcwq(cpu);
    if (wq->flags & WQ_NON_REENTRANT &&
    (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {

    Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
    It is sufficient to prevent to go into sub-optimal path.

    tj: Slightly rephrased the comment.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     
  • When we do tracing workqueue_queue_work(), it records requested cpu.
    But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
    requested cpu is changed as local cpu.
    In case of @wq->flag & WQ_UNBOUND, above change is not occured,
    therefore it is reasonable to correct it.

    Use temporary local variable for storing requested cpu.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     
  • Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
    WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
    for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
    size of pools, definition of worker_pool in gcwq doesn't use it.
    Using it makes code robust and prevent future mistakes.
    So change code to use this enum value.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     

14 Aug, 2012

1 commit

  • Any operation which clears PENDING should be preceded by a wmb to
    guarantee that the next PENDING owner sees all the changes made before
    PENDING release.

    There are only two places where PENDING is cleared -
    set_work_cpu_and_clear_pending() and clear_work_data(). The caller of
    the former already does smp_wmb() but the latter doesn't have any.

    Move the wmb above set_work_cpu_and_clear_pending() into it and add
    one to clear_work_data().

    There hasn't been any report related to this issue, and, given how
    clear_work_data() is used, it is extremely unlikely to have caused any
    actual problems on any architecture.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov

    Tejun Heo