27 Oct, 2015

1 commit

  • commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.

    My system keeps crashing with below message. vmstat_update() schedules a delayed
    work in current cpu and expects the work runs in the cpu.
    schedule_delayed_work() is expected to make delayed work run in local cpu. The
    problem is timer can be migrated with NO_HZ. __queue_work() queues work in
    timer handler, which could run in a different cpu other than where the delayed
    work is scheduled. The end result is the delayed work runs in different cpu.
    The patch makes __queue_delayed_work records local cpu earlier. Where the timer
    runs doesn't change where the work runs with the change.

    [ 28.010131] ------------[ cut here ]------------
    [ 28.010609] kernel BUG at ../mm/vmstat.c:1392!
    [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [ 28.011860] Modules linked in:
    [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634
    [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
    [ 28.014160] Workqueue: events vmstat_update
    [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
    [ 28.015445] RIP: 0010:[] []vmstat_update+0x31/0x80
    [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297
    [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
    [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
    [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
    [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
    [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
    [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
    [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
    [ 28.020071] Stack:
    [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
    [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
    [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
    [ 28.020071] Call Trace:
    [ 28.020071] [] process_one_work+0x1c8/0x540
    [ 28.020071] [] ? process_one_work+0x14b/0x540
    [ 28.020071] [] worker_thread+0x114/0x460
    [ 28.020071] [] ? process_one_work+0x540/0x540
    [ 28.020071] [] kthread+0xf8/0x110
    [ 28.020071] [] ?kthread_create_on_node+0x200/0x200
    [ 28.020071] [] ret_from_fork+0x3f/0x70
    [ 28.020071] [] ?kthread_create_on_node+0x200/0x200

    Signed-off-by: Shaohua Li
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

06 Apr, 2015

1 commit

  • The sysfs code usually belongs to the botom of the file since it deals
    with high level objects. In the workqueue code it's misplaced and such
    that we'll need to work around functions references to allow the sysfs
    code to call APIs like apply_workqueue_attrs().

    Lets move that block further in the file, almost the botom.

    And declare workqueue_sysfs_unregister() just before destroy_workqueue()
    which reference it.

    tj: Moved workqueue_sysfs_unregister() forward declaration where other
    forward declarations are.

    Suggested-by: Tejun Heo
    Cc: Christoph Lameter
    Cc: Kevin Hilman
    Cc: Lai Jiangshan
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Tejun Heo
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Frederic Weisbecker
     

09 Mar, 2015

3 commits

  • Workqueues are used extensively throughout the kernel but sometimes
    it's difficult to debug stalls involving work items because visibility
    into its inner workings is fairly limited. Although sysrq-t task dump
    annotates each active worker task with the information on the work
    item being executed, it is challenging to find out which work items
    are pending or delayed on which queues and how pools are being
    managed.

    This patch implements show_workqueue_state() which dumps all busy
    workqueues and pools and is called from the sysrq-t handler. At the
    end of sysrq-t dump, something like the following is printed.

    Showing busy workqueues and worker pools:
    ...
    workqueue filler_wq: flags=0x0
    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
    in-flight: 491:filler_workfn, 507:filler_workfn
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    in-flight: 501:filler_workfn
    pending: filler_workfn
    ...
    workqueue test_wq: flags=0x8
    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
    in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
    delayed: test_workfn1 BAR(492), test_workfn2
    ...
    pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
    pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
    pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
    pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62

    The above shows that test_wq is executing test_workfn() on pid 510
    which is the rescuer and also that there are two tasks 69 and 500
    waiting for the work item to finish in flush_work(). As test_wq has
    max_active of 1, there are two work items for test_workfn1() and
    test_workfn2() which are delayed till the current work item is
    finished. In addition, pid 492 is flushing test_workfn1().

    The work item for test_workfn() is being executed on pwq of pool 2
    which is the normal priority per-cpu pool for CPU 1. The pool has
    three workers, two of which are executing filler_workfn() for
    filler_wq and the last one is assuming the manager role trying to
    create more workers.

    This extra workqueue state dump will hopefully help chasing down hangs
    involving workqueues.

    v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.

    v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
    printk()'s replaced with pr_info()'s, and cpumask printing now
    uses cpulist_pr_cont().

    Signed-off-by: Tejun Heo
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Andrew Morton
    CC: Ingo Molnar

    Tejun Heo
     
  • Add wq_barrier->task and worker_pool->manager to keep track of the
    flushing task and pool manager respectively. These are purely
    informational and will be used to implement sysrq dump of workqueues.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • The workqueues list is protected by wq_pool_mutex and a workqueue and
    its subordinate data structures are freed directly on destruction. We
    want to add the ability dump workqueues from a sysrq callback which
    requires walking all workqueues without grabbing wq_pool_mutex. This
    patch makes freeing of workqueues RCU protected and makes the
    workqueues list walkable while holding RCU read lock.

    Note that pool_workqueues and pools are already sched-RCU protected.
    For consistency, workqueues are also protected with sched-RCU.

    While at it, reverse the workqueues list so that a workqueue which is
    created earlier comes before. The order of the list isn't significant
    functionally but this makes the planned sysrq dump list system
    workqueues first.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

05 Mar, 2015

1 commit

  • cancel[_delayed]_work_sync() are implemented using
    __cancel_work_timer() which grabs the PENDING bit using
    try_to_grab_pending() and then flushes the work item with PENDING set
    to prevent the on-going execution of the work item from requeueing
    itself.

    try_to_grab_pending() can always grab PENDING bit without blocking
    except when someone else is doing the above flushing during
    cancelation. In that case, try_to_grab_pending() returns -ENOENT. In
    this case, __cancel_work_timer() currently invokes flush_work(). The
    assumption is that the completion of the work item is what the other
    canceling task would be waiting for too and thus waiting for the same
    condition and retrying should allow forward progress without excessive
    busy looping

    Unfortunately, this doesn't work if preemption is disabled or the
    latter task has real time priority. Let's say task A just got woken
    up from flush_work() by the completion of the target work item. If,
    before task A starts executing, task B gets scheduled and invokes
    __cancel_work_timer() on the same work item, its try_to_grab_pending()
    will return -ENOENT as the work item is still being canceled by task A
    and flush_work() will also immediately return false as the work item
    is no longer executing. This puts task B in a busy loop possibly
    preventing task A from executing and clearing the canceling state on
    the work item leading to a hang.

    task A task B worker

    executing work
    __cancel_work_timer()
    try_to_grab_pending()
    set work CANCELING
    flush_work()
    block for work completion
    completion, wakes up A
    __cancel_work_timer()
    while (forever) {
    try_to_grab_pending()
    -ENOENT as work is being canceled
    flush_work()
    false as work is no longer executing
    }

    This patch removes the possible hang by updating __cancel_work_timer()
    to explicitly wait for clearing of CANCELING rather than invoking
    flush_work() after try_to_grab_pending() fails with -ENOENT.

    Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com

    v3: bit_waitqueue() can't be used for work items defined in vmalloc
    area. Switched to custom wake function which matches the target
    work item and exclusive wait and wakeup.

    v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
    the target bit waitqueue has wait_bit_queue's on it. Use
    DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu
    Vizoso.

    Signed-off-by: Tejun Heo
    Reported-by: Rabin Vincent
    Cc: Tomeu Vizoso
    Cc: stable@vger.kernel.org
    Tested-by: Jesper Nilsson
    Tested-by: Rabin Vincent

    Tejun Heo
     

14 Feb, 2015

1 commit


17 Jan, 2015

1 commit

  • A worker_pool's forward progress is guaranteed by the fact that the
    last idle worker assumes the manager role to create more workers and
    summon the rescuers if creating workers doesn't succeed in timely
    manner before proceeding to execute work items.

    This manager role is implemented in manage_workers(), which indicates
    whether the worker may proceed to work item execution with its return
    value. This is necessary because multiple workers may contend for the
    manager role, and, if there already is a manager, others should
    proceed to work item execution.

    Unfortunately, the function also indicates that the worker may proceed
    to work item execution if need_to_create_worker() is false at the head
    of the function. need_to_create_worker() tests the following
    conditions.

    pending work items && !nr_running && !nr_idle

    The first and third conditions are protected by pool->lock and thus
    won't change while holding pool->lock; however, nr_running can change
    asynchronously as other workers block and resume and while it's likely
    to be zero, as someone woke this worker up in the first place, some
    other workers could have become runnable inbetween making it non-zero.

    If this happens, manage_worker() could return false even with zero
    nr_idle making the worker, the last idle one, proceed to execute work
    items. If then all workers of the pool end up blocking on a resource
    which can only be released by a work item which is pending on that
    pool, the whole pool can deadlock as there's no one to create more
    workers or summon the rescuers.

    This patch fixes the problem by removing the early exit condition from
    maybe_create_worker() and making manage_workers() return false iff
    there's already another manager, which ensures that the last worker
    doesn't start executing work items.

    We can leave the early exit condition alone and just ignore the return
    value but the only reason it was put there is because the
    manage_workers() used to perform both creations and destructions of
    workers and thus the function may be invoked while the pool is trying
    to reduce the number of workers. Now that manage_workers() is called
    only when more workers are needed, the only case this early exit
    condition is triggered is rare race conditions rendering it pointless.

    Tested with simulated workload and modified workqueue code which
    trigger the pool deadlock reliably without this patch.

    Signed-off-by: Tejun Heo
    Reported-by: Eric Sandeen
    Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
    Cc: Dave Chinner
    Cc: Lai Jiangshan
    Cc: stable@vger.kernel.org

    Tejun Heo
     

09 Dec, 2014

2 commits

  • When there is serious memory pressure, all workers in a pool could be
    blocked, and a new thread cannot be created because it requires memory
    allocation.

    In this situation a WQ_MEM_RECLAIM workqueue will wake up the
    rescuer thread to do some work.

    The rescuer will only handle requests that are already on ->worklist.
    If max_requests is 1, that means it will handle a single request.

    The rescuer will be woken again in 100ms to handle another max_requests
    requests.

    I've seen a machine (running a 3.0 based "enterprise" kernel) with
    thousands of requests queued for xfslogd, which has a max_requests of
    1, and is needed for retiring all 'xfs' write requests. When one of
    the worker pools gets into this state, it progresses extremely slowly
    and possibly never recovers (only waited an hour or two).

    With this patch we leave a pool_workqueue on mayday list
    until it is clearly no longer in need of assistance. This allows
    all requests to be handled in a timely fashion.

    We keep each pool_workqueue on the mayday list until
    need_to_create_worker() is false, and no work for this workqueue is
    found in the pool.

    I have tested this in combination with a (hackish) patch which forces
    all work items to be handled by the rescuer thread. In that context
    it significantly improves performance. A similar patch for a 3.0
    kernel significantly improved performance on a heavy work load.

    Thanks to Jan Kara for some design ideas, and to Dongsu Park for
    some comments and testing.

    tj: Inverted the lock order between wq_mayday_lock and pool->lock with
    a preceding patch and simplified this patch. Added comment and
    updated changelog accordingly. Dongsu spotted missing get_pwq()
    in the simplified code.

    Cc: Dongsu Park
    Cc: Jan Kara
    Cc: Lai Jiangshan
    Signed-off-by: NeilBrown
    Signed-off-by: Tejun Heo

    NeilBrown
     
  • Currently, pool->lock nests inside pool->lock. There's no inherent
    reason for this order. The only place where the two locks are held
    together is pool_mayday_timeout() and it just got decided that way.

    This nesting order turns out to complicate things with the planned
    rescuer_thread() update. Let's invert them. This doesn't cause any
    behavior differences.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan
    Cc: NeilBrown
    Cc: Dongsu Park

    Tejun Heo
     

04 Dec, 2014

1 commit

  • rescuer_thread() caches &rescuer->scheduled in a local variable
    scheduled for convenience. There's one WARN_ON_ONCE() which was using
    &rescuer->scheduled directly. Replace it with the local variable.

    This patch causes no functional difference.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

06 Oct, 2014

2 commits

  • Tidy up and use cond_resched_rcu_qs when calling cond_resched and
    reporting potential quiescent state to RCU. Splitting this change in
    this way allows easy backporting to -stable for kernel versions not
    having cond_resched_rcu_qs().

    Signed-off-by: Joe Lawrence
    Acked-by: Tejun Heo
    Signed-off-by: Paul E. McKenney

    Joe Lawrence
     
  • Similar to the stop_machine deadlock scenario on !PREEMPT kernels
    addressed in b22ce2785d97 "workqueue: cond_resched() after processing
    each work item", kworker threads requeueing back-to-back with zero jiffy
    delay can stall RCU. The cond_resched call introduced in that fix will
    yield only iff there are other higher priority tasks to run, so force a
    quiescent RCU state between work items.

    Signed-off-by: Joe Lawrence
    Link: https://lkml.kernel.org/r/20140926105227.01325697@jlaw-desktop.mno.stratus.com
    Link: https://lkml.kernel.org/r/20140929115445.40221d8e@jlaw-desktop.mno.stratus.com
    Fixes: b22ce2785d97 ("workqueue: cond_resched() after processing each work item")
    Cc:
    Acked-by: Tejun Heo
    Signed-off-by: Paul E. McKenney

    Joe Lawrence
     

05 Aug, 2014

2 commits

  • Pull percpu updates from Tejun Heo:

    - Major reorganization of percpu header files which I think makes
    things a lot more readable and logical than before.

    - percpu-refcount is updated so that it requires explicit destruction
    and can be reinitialized if necessary. This was pulled into the
    block tree to replace the custom percpu refcnting implemented in
    blk-mq.

    - In the process, percpu and percpu-refcount got cleaned up a bit

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits)
    percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
    percpu-refcount: require percpu_ref to be exited explicitly
    percpu-refcount: use unsigned long for pcpu_count pointer
    percpu-refcount: add helpers for ->percpu_count accesses
    percpu-refcount: one bit is enough for REF_STATUS
    percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
    workqueue: stronger test in process_one_work()
    workqueue: clear POOL_DISASSOCIATED in rebind_workers()
    percpu: Use ALIGN macro instead of hand coding alignment calculation
    percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
    percpu: preffity percpu header files
    percpu: use raw_cpu_*() to define __this_cpu_*()
    percpu: reorder macros in percpu header files
    percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
    percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
    percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
    percpu: reorganize include/linux/percpu-defs.h
    percpu: move accessors from include/linux/percpu.h to percpu-defs.h
    percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
    percpu: introduce arch_raw_cpu_ptr()
    ...

    Linus Torvalds
     
  • Pull workqueue updates from Tejun Heo:
    "Lai has been doing a lot of cleanups of workqueue and kthread_work.
    No significant behavior change. Just a lot of cleanups all over the
    place. Some are a bit invasive but overall nothing too dangerous"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    kthread_work: remove the unused wait_queue_head
    kthread_work: wake up worker only when the worker is idle
    workqueue: use nr_node_ids instead of wq_numa_tbl_len
    workqueue: remove the misnamed out_unlock label in get_unbound_pool()
    workqueue: remove the stale comment in pwq_unbound_release_workfn()
    workqueue: move rescuer pool detachment to the end
    workqueue: unfold start_worker() into create_worker()
    workqueue: remove @wakeup from worker_set_flags()
    workqueue: remove an unneeded UNBOUND test before waking up the next worker
    workqueue: wake regular worker if need_more_worker() when rescuer leave the pool
    workqueue: alloc struct worker on its local node
    workqueue: reuse the already calculated pwq in try_to_grab_pending()
    workqueue: stronger test in process_one_work()
    workqueue: clear POOL_DISASSOCIATED in rebind_workers()
    workqueue: sanity check pool->cpu in wq_worker_sleeping()
    workqueue: clear leftover flags when detached
    workqueue: remove useless WARN_ON_ONCE()
    workqueue: use schedule_timeout_interruptible() instead of open code
    workqueue: remove the empty check in too_many_workers()
    workqueue: use "pool->cpu < 0" to stand for an unbound pool

    Linus Torvalds
     

23 Jul, 2014

6 commits

  • They are the same and nr_node_ids is provided by the memory subsystem.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • After the locking was moved up to the caller of the get_unbound_pool(),
    out_unlock label doesn't need to do any unlock operation and the name
    became bad, so we just remove this label, and the only usage-site
    "goto out_unlock" is subsituted to "return pool".

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • In 75ccf5950f82 ("workqueue: prepare flush_workqueue() for dynamic
    creation and destrucion of unbound pool_workqueues"), a comment
    about the synchronization for the pwq in pwq_unbound_release_workfn()
    was added. The comment claimed the flush_mutex wasn't strictly
    necessary, it was correct in that time, due to the pwq was protected
    by workqueue_lock.

    But it is incorrect now since the wq->flush_mutex was renamed to
    wq->mutex and workqueue_lock was removed, the wq->mutex is strictly
    needed. But the comment was miss-updated when the synchronization
    was changed.

    This patch removes the incorrect comments and doesn't add any new
    comment to explain why wq->mutex is needed here, which is definitely
    obvious and wq->pwqs_node has "WQ" notation in its definition which is
    better comment.

    The old commit mentioned above also introduced a comment in link_pwq()
    about the synchronization. This comment is also removed in this patch
    since the whole link_pwq() is proteced by wq->mutex.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • In 51697d393922 ("workqueue: use generic attach/detach routine for
    rescuers"), The rescuer detaches itself from the pool before put_pwq()
    so that the put_unbound_pool() will not destroy the rescuer-attached
    pool.

    It is unnecessary. worker_detach_from_pool() can be used as the last
    statement to access to the pool just like the regular workers,
    put_unbound_pool() will wait for it to detach and then free the pool.

    So we move the worker_detach_from_pool() down, make it coincide with
    the regular workers.

    tj: Minor description update.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • Simply unfold the code of start_worker() into create_worker() and
    remove the original start_worker() and create_and_start_worker().

    The only trade-off is the introduced overhead that the pool->lock
    is released and regrabbed after the newly worker is started.
    The overhead is acceptible since the manager is slow path.

    And because this new locking behavior, the newly created worker
    may grab the lock earlier than the manager and go to process
    work items. In this case, the recheck need_to_create_worker() may be
    true as expected and the manager goes to restart which is the
    correct behavior.

    tj: Minor updates to description and comments.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • worker_set_flags() has only two callers, each specifying %true and
    %false for @wakeup. Let's push the wake up to the caller and remove
    @wakeup from worker_set_flags(). The caller can use the following
    instead if wakeup is necessary:

    worker_set_flags();
    if (need_more_worker(pool))
    wake_up_worker(pool);

    This makes the code simpler. This patch doesn't introduce behavior
    changes.

    tj: Updated description and comments.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

22 Jul, 2014

1 commit

  • In process_one_work():

    if ((worker->flags & WORKER_UNBOUND) && need_more_worker(pool))
    wake_up_worker(pool);

    the first test is unneeded. Even if the first test is removed, it
    doesn't affect the wake-up logic for WORKER_UNBOUND, and it will not
    introduce any useless wake-ups for normal per-cpu workers since
    nr_running is always >= 1. It will introduce useless/redundant
    wake-ups for CPU_INTENSIVE, but this case is rare and the next patch
    will also remove this redundant wake-up.

    tj: Minor updates to the description and comment.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

19 Jul, 2014

1 commit


15 Jul, 2014

1 commit

  • When the create_worker() is called from non-manager, the struct worker
    is allocated from the node of the caller which may be different from the
    node of pool->node.

    So we add a node ID argument for the alloc_worker() to ensure the
    struct worker is allocated from the preferable node.

    tj: @nid renamed to @node for consistency.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

11 Jul, 2014

1 commit


07 Jul, 2014

1 commit

  • When hot-adding and onlining CPU, kernel panic occurs, showing following
    call trace.

    BUG: unable to handle kernel paging request at 0000000000001d08
    IP: [] __alloc_pages_nodemask+0x9d/0xb10
    PGD 0
    Oops: 0000 [#1] SMP
    ...
    Call Trace:
    [] ? cpumask_next_and+0x35/0x50
    [] ? find_busiest_group+0x113/0x8f0
    [] ? deactivate_slab+0x349/0x3c0
    [] new_slab+0x91/0x300
    [] __slab_alloc+0x2bb/0x482
    [] ? copy_process.part.25+0xfc/0x14c0
    [] ? load_balance+0x218/0x890
    [] ? sched_clock+0x9/0x10
    [] ? trace_clock_local+0x9/0x10
    [] kmem_cache_alloc_node+0x8c/0x200
    [] copy_process.part.25+0xfc/0x14c0
    [] ? trace_buffer_unlock_commit+0x4d/0x60
    [] ? kthread_create_on_node+0x140/0x140
    [] do_fork+0xbc/0x360
    [] kernel_thread+0x26/0x30
    [] kthreadd+0x2c2/0x300
    [] ? kthread_create_on_cpu+0x60/0x60
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_create_on_cpu+0x60/0x60

    In my investigation, I found the root cause is wq_numa_possible_cpumask.
    All entries of wq_numa_possible_cpumask is allocated by
    alloc_cpumask_var_node(). And these entries are used without initializing.
    So these entries have wrong value.

    When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
    wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
    calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
    as follow:

    3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
    3593 if (wq_numa_enabled) {
    3594 for_each_node(node) {
    3595 if (cpumask_subset(pool->attrs->cpumask,
    3596 wq_numa_possible_cpumask[node])) {
    3597 pool->node = node;
    3598 break;
    3599 }
    3600 }
    3601 }

    But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
    node is selected. As a result, kernel panic occurs.

    By this patch, all entries of wq_numa_possible_cpumask are allocated by
    zalloc_cpumask_var_node to initialize them. And the panic disappeared.

    Signed-off-by: Yasuaki Ishimatsu
    Reviewed-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")

    Yasuaki Ishimatsu
     

02 Jul, 2014

2 commits

  • When POOL_DISASSOCIATED is cleared, the running worker's local CPU should
    be the same as pool->cpu without any exception even during cpu-hotplug.

    This patch changes "(proposition_A && proposition_B && proposition_C)"
    to "(proposition_B && proposition_C)", so if the old compound
    proposition is true, the new one must be true too. so this won't hide
    any possible bug which can be hit by old test.

    tj: Minor description update and dropped the obvious comment.

    CC: Jason J. Herne
    CC: Sasha Levin
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • a9ab775bcadf ("workqueue: directly restore CPU affinity of workers
    from CPU_ONLINE") moved pool locking into rebind_workers() but left
    "pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback().

    There is nothing necessarily wrong with it, but there is no benefit
    either. Let's move it into rebind_workers() and achieve the following
    benefits:

    1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers()
    as expected.

    2) we can guarantee that, when POOL_DISASSOCIATED is clear, the
    running workers of the pool are on the local CPU (pool->cpu).

    tj: Minor description update.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

24 Jun, 2014

1 commit


20 Jun, 2014

8 commits

  • After the recent changes, when POOL_DISASSOCIATED is cleared, the
    running worker's local CPU should be the same as pool->cpu without any
    exception even during cpu-hotplug. Update the sanity check in
    process_one_work() accordingly.

    This patch changes "(proposition_A && proposition_B && proposition_C)"
    to "(proposition_B && proposition_C)", so if the old compound
    proposition is true, the new one must be true too. so this will not
    hide any possible bug which can be caught by the old test.

    tj: Minor updates to the description.

    CC: Jason J. Herne
    CC: Sasha Levin
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • The commit a9ab775bcadf ("workqueue: directly restore CPU affinity of
    workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
    without also moving "pool->flags &= ~POOL_DISASSOCIATED".

    There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
    being moved together, but there isn't any benefit either. We move it
    into rebind_workers() and achieve these benefits:

    1) Better readability. POOL_DISASSOCIATED is cleared in
    rebind_workers() as expected.

    2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
    running workers of the pool are on the local CPU (pool->cpu).

    tj: Cosmetic updates to the code and description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • In theory, pool->cpu is equals to @cpu in wq_worker_sleeping() after
    worker->flags is checked.

    And "pool->cpu != cpu" sanity check will help us if something wrong.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • When a worker is detached, the worker->flags may still have WORKER_UNBOUND
    or WORKER_REBOUND, it is OK for all cases:
    1) if it is a normal worker, the worker will be dead, it is OK.
    2) if it is a rescuer, it may re-attach to a pool with this leftover flag[s],
    it is still correct except it may cause unneeded wakeup.

    It is correct but not good, so we just remove the leftover flags.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • The @cpu is fetched via smp_processor_id() in this function,
    so the check is useless.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • schedule_timeout_interruptible(CREATE_COOLDOWN) is exactly the same as
    the original code.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • The commit ea1abd6197d5 ("workqueue: reimplement idle worker rebinding")
    used a trick which simply removes all to-be-bound idle workers from the
    idle list and lets them add themselves back after completing rebinding.

    And this trick caused the @worker_pool->nr_idle may deviate than the actual
    number of idle workers on @worker_pool->idle_list. More specifically,
    nr_idle may be non-zero while ->idle_list is empty. All users of
    ->nr_idle and ->idle_list are audited. The only affected one is
    too_many_workers() which is updated to check %false if ->idle_list is
    empty regardless of ->nr_idle.

    The commit/trick was complicated due to it just tried to simplify an even
    more complicated problem (workers had to rebind itself). But the commit
    a9ab775bcadf ("workqueue: directly restore CPU affinity of workers
    from CPU_ONLINE") fixed all these problems and the mentioned trick was
    useless and is gone.

    So, now the @worker_pool->nr_idle is exactly the actual number of workers
    on @worker_pool->idle_list. too_many_workers() should recover as it was
    before the trick. So we remove the empty check.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • There is a piece of sanity checks code in the put_unbound_pool().
    The meaning of this code is "if it is not an unbound pool, it will complain
    and return" IIUC. But the code uses "pool->flags & POOL_DISASSOCIATED"
    imprecisely due to a non-unbound pool may also have this flags.

    We should use "pool->cpu < 0" to stand for an unbound pool, so we covert the
    code to it.

    There is no strictly wrong if we still keep "pool->flags & POOL_DISASSOCIATED"
    here, but it is just a noise if we keep it:
    1) we focus on "unbound" here, not "[dis]association".
    2) "pool->cpu < 0" already implies "pool->flags & POOL_DISASSOCIATED".

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

10 Jun, 2014

1 commit

  • Pull workqueue updates from Tejun Heo:
    "Lai simplified worker destruction path and internal workqueue locking
    and there are some other minor changes.

    Except for the removal of some long-deprecated interfaces which
    haven't had any in-kernel user for quite a while, there shouldn't be
    any difference to workqueue users"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
    workqueue: remove the confusing POOL_FREEZING
    workqueue: rename first_worker() to first_idle_worker()
    workqueue: remove unused work_clear_pending()
    workqueue: remove unused WORK_CPU_END
    workqueue: declare system_highpri_wq
    workqueue: use generic attach/detach routine for rescuers
    workqueue: separate pool-attaching code out from create_worker()
    workqueue: rename manager_mutex to attach_mutex
    workqueue: narrow the protection range of manager_mutex
    workqueue: convert worker_idr to worker_ida
    workqueue: separate iteration role from worker_idr
    workqueue: destroy worker directly in the idle timeout handler
    workqueue: async worker destruction
    workqueue: destroy_worker() should destroy idle workers only
    workqueue: use manager lock only to protect worker_idr
    workqueue: Remove deprecated system_nrt[_freezable]_wq
    workqueue: Remove deprecated flush[_delayed]_work_sync()
    kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
    workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's

    Linus Torvalds
     

28 May, 2014

1 commit

  • This commit did an incorrect printk->pr_info conversion. If we were
    converting to pr_info() we should lose the log_level parameter. The problem is
    that this is called (indirectly) by show_regs_print_info(), which is called
    with various log_levels (from _INFO clear to _EMERG). So we leave it as
    a printk() call so the desired log_level is applied.

    Not a full revert, as the other half of the patch is correct.

    Signed-off-by: Valdis Kletnieks
    Signed-off-by: Tejun Heo

    Valdis Kletnieks
     

22 May, 2014

1 commit

  • Currently, the global freezing state is propagated to worker_pools via
    POOL_FREEZING and then to each workqueue; however, the middle step -
    propagation through worker_pools - can be skipped as long as one or
    more max_active adjustments happens for each workqueue after the
    update to the global state is visible. The global workqueue freezing
    state and the max_active adjustments during workqueue creation and
    [un]freezing are serialized with wq_pool_mutex, so it's trivial to
    guarantee that max_actives stay in sync with global freezing state.

    POOL_FREEZING is unnecessary and makes the code more confusing and
    complicates freeze_workqueues_begin() and thaw_workqueues() by
    requiring them to walk through all pools.

    Remove POOL_FREEZING and use workqueue_freezing directly instead.

    tj: Description and comment updates.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan