27 Oct, 2015
1 commit
-
commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.
My system keeps crashing with below message. vmstat_update() schedules a delayed
work in current cpu and expects the work runs in the cpu.
schedule_delayed_work() is expected to make delayed work run in local cpu. The
problem is timer can be migrated with NO_HZ. __queue_work() queues work in
timer handler, which could run in a different cpu other than where the delayed
work is scheduled. The end result is the delayed work runs in different cpu.
The patch makes __queue_delayed_work records local cpu earlier. Where the timer
runs doesn't change where the work runs with the change.[ 28.010131] ------------[ cut here ]------------
[ 28.010609] kernel BUG at ../mm/vmstat.c:1392!
[ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[ 28.011860] Modules linked in:
[ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634
[ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[ 28.014160] Workqueue: events vmstat_update
[ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
[ 28.015445] RIP: 0010:[] []vmstat_update+0x31/0x80
[ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297
[ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
[ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
[ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
[ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
[ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
[ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
[ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
[ 28.020071] Stack:
[ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
[ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
[ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
[ 28.020071] Call Trace:
[ 28.020071] [] process_one_work+0x1c8/0x540
[ 28.020071] [] ? process_one_work+0x14b/0x540
[ 28.020071] [] worker_thread+0x114/0x460
[ 28.020071] [] ? process_one_work+0x540/0x540
[ 28.020071] [] kthread+0xf8/0x110
[ 28.020071] [] ?kthread_create_on_node+0x200/0x200
[ 28.020071] [] ret_from_fork+0x3f/0x70
[ 28.020071] [] ?kthread_create_on_node+0x200/0x200Signed-off-by: Shaohua Li
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman
06 Apr, 2015
1 commit
-
The sysfs code usually belongs to the botom of the file since it deals
with high level objects. In the workqueue code it's misplaced and such
that we'll need to work around functions references to allow the sysfs
code to call APIs like apply_workqueue_attrs().Lets move that block further in the file, almost the botom.
And declare workqueue_sysfs_unregister() just before destroy_workqueue()
which reference it.tj: Moved workqueue_sysfs_unregister() forward declaration where other
forward declarations are.Suggested-by: Tejun Heo
Cc: Christoph Lameter
Cc: Kevin Hilman
Cc: Lai Jiangshan
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Tejun Heo
Cc: Viresh Kumar
Signed-off-by: Frederic Weisbecker
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
09 Mar, 2015
3 commits
-
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().Signed-off-by: Tejun Heo
Cc: Lai Jiangshan
Cc: Linus Torvalds
Cc: Andrew Morton
CC: Ingo Molnar -
Add wq_barrier->task and worker_pool->manager to keep track of the
flushing task and pool manager respectively. These are purely
informational and will be used to implement sysrq dump of workqueues.Signed-off-by: Tejun Heo
-
The workqueues list is protected by wq_pool_mutex and a workqueue and
its subordinate data structures are freed directly on destruction. We
want to add the ability dump workqueues from a sysrq callback which
requires walking all workqueues without grabbing wq_pool_mutex. This
patch makes freeing of workqueues RCU protected and makes the
workqueues list walkable while holding RCU read lock.Note that pool_workqueues and pools are already sched-RCU protected.
For consistency, workqueues are also protected with sched-RCU.While at it, reverse the workqueues list so that a workqueue which is
created earlier comes before. The order of the list isn't significant
functionally but this makes the planned sysrq dump list system
workqueues first.Signed-off-by: Tejun Heo
05 Mar, 2015
1 commit
-
cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation. In that case, try_to_grab_pending() returns -ENOENT. In
this case, __cancel_work_timer() currently invokes flush_work(). The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy loopingUnfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority. Let's say task A just got woken
up from flush_work() by the completion of the target work item. If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing. This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.task A task B worker
executing work
__cancel_work_timer()
try_to_grab_pending()
set work CANCELING
flush_work()
block for work completion
completion, wakes up A
__cancel_work_timer()
while (forever) {
try_to_grab_pending()
-ENOENT as work is being canceled
flush_work()
false as work is no longer executing
}This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT.Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
v3: bit_waitqueue() can't be used for work items defined in vmalloc
area. Switched to custom wake function which matches the target
work item and exclusive wait and wakeup.v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
the target bit waitqueue has wait_bit_queue's on it. Use
DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu
Vizoso.Signed-off-by: Tejun Heo
Reported-by: Rabin Vincent
Cc: Tomeu Vizoso
Cc: stable@vger.kernel.org
Tested-by: Jesper Nilsson
Tested-by: Rabin Vincent
14 Feb, 2015
1 commit
-
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.Signed-off-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jan, 2015
1 commit
-
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.Signed-off-by: Tejun Heo
Reported-by: Eric Sandeen
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner
Cc: Lai Jiangshan
Cc: stable@vger.kernel.org
09 Dec, 2014
2 commits
-
When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.The rescuer will be woken again in 100ms to handle another max_requests
requests.I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests. When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance. This allows
all requests to be handled in a timely fashion.We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread. In that context
it significantly improves performance. A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.Thanks to Jan Kara for some design ideas, and to Dongsu Park for
some comments and testing.tj: Inverted the lock order between wq_mayday_lock and pool->lock with
a preceding patch and simplified this patch. Added comment and
updated changelog accordingly. Dongsu spotted missing get_pwq()
in the simplified code.Cc: Dongsu Park
Cc: Jan Kara
Cc: Lai Jiangshan
Signed-off-by: NeilBrown
Signed-off-by: Tejun Heo -
Currently, pool->lock nests inside pool->lock. There's no inherent
reason for this order. The only place where the two locks are held
together is pool_mayday_timeout() and it just got decided that way.This nesting order turns out to complicate things with the planned
rescuer_thread() update. Let's invert them. This doesn't cause any
behavior differences.Signed-off-by: Tejun Heo
Reviewed-by: Lai Jiangshan
Cc: NeilBrown
Cc: Dongsu Park
04 Dec, 2014
1 commit
-
rescuer_thread() caches &rescuer->scheduled in a local variable
scheduled for convenience. There's one WARN_ON_ONCE() which was using
&rescuer->scheduled directly. Replace it with the local variable.This patch causes no functional difference.
Signed-off-by: Tejun Heo
06 Oct, 2014
2 commits
-
Tidy up and use cond_resched_rcu_qs when calling cond_resched and
reporting potential quiescent state to RCU. Splitting this change in
this way allows easy backporting to -stable for kernel versions not
having cond_resched_rcu_qs().Signed-off-by: Joe Lawrence
Acked-by: Tejun Heo
Signed-off-by: Paul E. McKenney -
Similar to the stop_machine deadlock scenario on !PREEMPT kernels
addressed in b22ce2785d97 "workqueue: cond_resched() after processing
each work item", kworker threads requeueing back-to-back with zero jiffy
delay can stall RCU. The cond_resched call introduced in that fix will
yield only iff there are other higher priority tasks to run, so force a
quiescent RCU state between work items.Signed-off-by: Joe Lawrence
Link: https://lkml.kernel.org/r/20140926105227.01325697@jlaw-desktop.mno.stratus.com
Link: https://lkml.kernel.org/r/20140929115445.40221d8e@jlaw-desktop.mno.stratus.com
Fixes: b22ce2785d97 ("workqueue: cond_resched() after processing each work item")
Cc:
Acked-by: Tejun Heo
Signed-off-by: Paul E. McKenney
05 Aug, 2014
2 commits
-
Pull percpu updates from Tejun Heo:
- Major reorganization of percpu header files which I think makes
things a lot more readable and logical than before.- percpu-refcount is updated so that it requires explicit destruction
and can be reinitialized if necessary. This was pulled into the
block tree to replace the custom percpu refcnting implemented in
blk-mq.- In the process, percpu and percpu-refcount got cleaned up a bit
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits)
percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
percpu-refcount: require percpu_ref to be exited explicitly
percpu-refcount: use unsigned long for pcpu_count pointer
percpu-refcount: add helpers for ->percpu_count accesses
percpu-refcount: one bit is enough for REF_STATUS
percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
workqueue: stronger test in process_one_work()
workqueue: clear POOL_DISASSOCIATED in rebind_workers()
percpu: Use ALIGN macro instead of hand coding alignment calculation
percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
percpu: preffity percpu header files
percpu: use raw_cpu_*() to define __this_cpu_*()
percpu: reorder macros in percpu header files
percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
percpu: reorganize include/linux/percpu-defs.h
percpu: move accessors from include/linux/percpu.h to percpu-defs.h
percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
percpu: introduce arch_raw_cpu_ptr()
... -
Pull workqueue updates from Tejun Heo:
"Lai has been doing a lot of cleanups of workqueue and kthread_work.
No significant behavior change. Just a lot of cleanups all over the
place. Some are a bit invasive but overall nothing too dangerous"* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
kthread_work: remove the unused wait_queue_head
kthread_work: wake up worker only when the worker is idle
workqueue: use nr_node_ids instead of wq_numa_tbl_len
workqueue: remove the misnamed out_unlock label in get_unbound_pool()
workqueue: remove the stale comment in pwq_unbound_release_workfn()
workqueue: move rescuer pool detachment to the end
workqueue: unfold start_worker() into create_worker()
workqueue: remove @wakeup from worker_set_flags()
workqueue: remove an unneeded UNBOUND test before waking up the next worker
workqueue: wake regular worker if need_more_worker() when rescuer leave the pool
workqueue: alloc struct worker on its local node
workqueue: reuse the already calculated pwq in try_to_grab_pending()
workqueue: stronger test in process_one_work()
workqueue: clear POOL_DISASSOCIATED in rebind_workers()
workqueue: sanity check pool->cpu in wq_worker_sleeping()
workqueue: clear leftover flags when detached
workqueue: remove useless WARN_ON_ONCE()
workqueue: use schedule_timeout_interruptible() instead of open code
workqueue: remove the empty check in too_many_workers()
workqueue: use "pool->cpu < 0" to stand for an unbound pool
23 Jul, 2014
6 commits
-
They are the same and nr_node_ids is provided by the memory subsystem.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
After the locking was moved up to the caller of the get_unbound_pool(),
out_unlock label doesn't need to do any unlock operation and the name
became bad, so we just remove this label, and the only usage-site
"goto out_unlock" is subsituted to "return pool".Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
In 75ccf5950f82 ("workqueue: prepare flush_workqueue() for dynamic
creation and destrucion of unbound pool_workqueues"), a comment
about the synchronization for the pwq in pwq_unbound_release_workfn()
was added. The comment claimed the flush_mutex wasn't strictly
necessary, it was correct in that time, due to the pwq was protected
by workqueue_lock.But it is incorrect now since the wq->flush_mutex was renamed to
wq->mutex and workqueue_lock was removed, the wq->mutex is strictly
needed. But the comment was miss-updated when the synchronization
was changed.This patch removes the incorrect comments and doesn't add any new
comment to explain why wq->mutex is needed here, which is definitely
obvious and wq->pwqs_node has "WQ" notation in its definition which is
better comment.The old commit mentioned above also introduced a comment in link_pwq()
about the synchronization. This comment is also removed in this patch
since the whole link_pwq() is proteced by wq->mutex.Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
In 51697d393922 ("workqueue: use generic attach/detach routine for
rescuers"), The rescuer detaches itself from the pool before put_pwq()
so that the put_unbound_pool() will not destroy the rescuer-attached
pool.It is unnecessary. worker_detach_from_pool() can be used as the last
statement to access to the pool just like the regular workers,
put_unbound_pool() will wait for it to detach and then free the pool.So we move the worker_detach_from_pool() down, make it coincide with
the regular workers.tj: Minor description update.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
Simply unfold the code of start_worker() into create_worker() and
remove the original start_worker() and create_and_start_worker().The only trade-off is the introduced overhead that the pool->lock
is released and regrabbed after the newly worker is started.
The overhead is acceptible since the manager is slow path.And because this new locking behavior, the newly created worker
may grab the lock earlier than the manager and go to process
work items. In this case, the recheck need_to_create_worker() may be
true as expected and the manager goes to restart which is the
correct behavior.tj: Minor updates to description and comments.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
worker_set_flags() has only two callers, each specifying %true and
%false for @wakeup. Let's push the wake up to the caller and remove
@wakeup from worker_set_flags(). The caller can use the following
instead if wakeup is necessary:worker_set_flags();
if (need_more_worker(pool))
wake_up_worker(pool);This makes the code simpler. This patch doesn't introduce behavior
changes.tj: Updated description and comments.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
22 Jul, 2014
1 commit
-
In process_one_work():
if ((worker->flags & WORKER_UNBOUND) && need_more_worker(pool))
wake_up_worker(pool);the first test is unneeded. Even if the first test is removed, it
doesn't affect the wake-up logic for WORKER_UNBOUND, and it will not
introduce any useless wake-ups for normal per-cpu workers since
nr_running is always >= 1. It will introduce useless/redundant
wake-ups for CPU_INTENSIVE, but this case is rare and the next patch
will also remove this redundant wake-up.tj: Minor updates to the description and comment.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
19 Jul, 2014
1 commit
-
We don't need to wake up regular worker when nr_running==1,
so need_more_worker() is sufficient here.And need_more_worker() gives us better readability due to the name of
"keep_working()" implies the rescuer should keep working now but
the rescuer is actually leaving.Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
15 Jul, 2014
1 commit
-
When the create_worker() is called from non-manager, the struct worker
is allocated from the node of the caller which may be different from the
node of pool->node.So we add a node ID argument for the alloc_worker() to ensure the
struct worker is allocated from the preferable node.tj: @nid renamed to @node for consistency.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
11 Jul, 2014
1 commit
-
try_to_grab_pending() was re-calculating the associated pwq using
get_work_pwq() when it already has it cached in a local varible and
the association can't change. Reuse the local variable instead.This doesn't introduce any functional changes.
tj: Updated description.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
07 Jul, 2014
1 commit
-
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.BUG: unable to handle kernel paging request at 0000000000001d08
IP: [] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[] ? cpumask_next_and+0x35/0x50
[] ? find_busiest_group+0x113/0x8f0
[] ? deactivate_slab+0x349/0x3c0
[] new_slab+0x91/0x300
[] __slab_alloc+0x2bb/0x482
[] ? copy_process.part.25+0xfc/0x14c0
[] ? load_balance+0x218/0x890
[] ? sched_clock+0x9/0x10
[] ? trace_clock_local+0x9/0x10
[] kmem_cache_alloc_node+0x8c/0x200
[] copy_process.part.25+0xfc/0x14c0
[] ? trace_buffer_unlock_commit+0x4d/0x60
[] ? kthread_create_on_node+0x140/0x140
[] do_fork+0xbc/0x360
[] kernel_thread+0x26/0x30
[] kthreadd+0x2c2/0x300
[] ? kthread_create_on_cpu+0x60/0x60
[] ret_from_fork+0x7c/0xb0
[] ? kthread_create_on_cpu+0x60/0x60In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.Signed-off-by: Yasuaki Ishimatsu
Reviewed-by: Lai Jiangshan
Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org
Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
02 Jul, 2014
2 commits
-
When POOL_DISASSOCIATED is cleared, the running worker's local CPU should
be the same as pool->cpu without any exception even during cpu-hotplug.This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this won't hide
any possible bug which can be hit by old test.tj: Minor description update and dropped the obvious comment.
CC: Jason J. Herne
CC: Sasha Levin
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
a9ab775bcadf ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") moved pool locking into rebind_workers() but left
"pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback().There is nothing necessarily wrong with it, but there is no benefit
either. Let's move it into rebind_workers() and achieve the following
benefits:1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers()
as expected.2) we can guarantee that, when POOL_DISASSOCIATED is clear, the
running workers of the pool are on the local CPU (pool->cpu).tj: Minor description update.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
24 Jun, 2014
1 commit
-
Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.Signed-off-by: Maxime Bizon
Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org
Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
20 Jun, 2014
8 commits
-
After the recent changes, when POOL_DISASSOCIATED is cleared, the
running worker's local CPU should be the same as pool->cpu without any
exception even during cpu-hotplug. Update the sanity check in
process_one_work() accordingly.This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this will not
hide any possible bug which can be caught by the old test.tj: Minor updates to the description.
CC: Jason J. Herne
CC: Sasha Levin
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
The commit a9ab775bcadf ("workqueue: directly restore CPU affinity of
workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
without also moving "pool->flags &= ~POOL_DISASSOCIATED".There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
being moved together, but there isn't any benefit either. We move it
into rebind_workers() and achieve these benefits:1) Better readability. POOL_DISASSOCIATED is cleared in
rebind_workers() as expected.2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
running workers of the pool are on the local CPU (pool->cpu).tj: Cosmetic updates to the code and description.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
In theory, pool->cpu is equals to @cpu in wq_worker_sleeping() after
worker->flags is checked.And "pool->cpu != cpu" sanity check will help us if something wrong.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
When a worker is detached, the worker->flags may still have WORKER_UNBOUND
or WORKER_REBOUND, it is OK for all cases:
1) if it is a normal worker, the worker will be dead, it is OK.
2) if it is a rescuer, it may re-attach to a pool with this leftover flag[s],
it is still correct except it may cause unneeded wakeup.It is correct but not good, so we just remove the leftover flags.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
The @cpu is fetched via smp_processor_id() in this function,
so the check is useless.Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
schedule_timeout_interruptible(CREATE_COOLDOWN) is exactly the same as
the original code.Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
The commit ea1abd6197d5 ("workqueue: reimplement idle worker rebinding")
used a trick which simply removes all to-be-bound idle workers from the
idle list and lets them add themselves back after completing rebinding.And this trick caused the @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list. More specifically,
nr_idle may be non-zero while ->idle_list is empty. All users of
->nr_idle and ->idle_list are audited. The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.The commit/trick was complicated due to it just tried to simplify an even
more complicated problem (workers had to rebind itself). But the commit
a9ab775bcadf ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") fixed all these problems and the mentioned trick was
useless and is gone.So, now the @worker_pool->nr_idle is exactly the actual number of workers
on @worker_pool->idle_list. too_many_workers() should recover as it was
before the trick. So we remove the empty check.Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo -
There is a piece of sanity checks code in the put_unbound_pool().
The meaning of this code is "if it is not an unbound pool, it will complain
and return" IIUC. But the code uses "pool->flags & POOL_DISASSOCIATED"
imprecisely due to a non-unbound pool may also have this flags.We should use "pool->cpu < 0" to stand for an unbound pool, so we covert the
code to it.There is no strictly wrong if we still keep "pool->flags & POOL_DISASSOCIATED"
here, but it is just a noise if we keep it:
1) we focus on "unbound" here, not "[dis]association".
2) "pool->cpu < 0" already implies "pool->flags & POOL_DISASSOCIATED".Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
10 Jun, 2014
1 commit
-
Pull workqueue updates from Tejun Heo:
"Lai simplified worker destruction path and internal workqueue locking
and there are some other minor changes.Except for the removal of some long-deprecated interfaces which
haven't had any in-kernel user for quite a while, there shouldn't be
any difference to workqueue users"* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
workqueue: remove the confusing POOL_FREEZING
workqueue: rename first_worker() to first_idle_worker()
workqueue: remove unused work_clear_pending()
workqueue: remove unused WORK_CPU_END
workqueue: declare system_highpri_wq
workqueue: use generic attach/detach routine for rescuers
workqueue: separate pool-attaching code out from create_worker()
workqueue: rename manager_mutex to attach_mutex
workqueue: narrow the protection range of manager_mutex
workqueue: convert worker_idr to worker_ida
workqueue: separate iteration role from worker_idr
workqueue: destroy worker directly in the idle timeout handler
workqueue: async worker destruction
workqueue: destroy_worker() should destroy idle workers only
workqueue: use manager lock only to protect worker_idr
workqueue: Remove deprecated system_nrt[_freezable]_wq
workqueue: Remove deprecated flush[_delayed]_work_sync()
kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
28 May, 2014
1 commit
-
This commit did an incorrect printk->pr_info conversion. If we were
converting to pr_info() we should lose the log_level parameter. The problem is
that this is called (indirectly) by show_regs_print_info(), which is called
with various log_levels (from _INFO clear to _EMERG). So we leave it as
a printk() call so the desired log_level is applied.Not a full revert, as the other half of the patch is correct.
Signed-off-by: Valdis Kletnieks
Signed-off-by: Tejun Heo
22 May, 2014
1 commit
-
Currently, the global freezing state is propagated to worker_pools via
POOL_FREEZING and then to each workqueue; however, the middle step -
propagation through worker_pools - can be skipped as long as one or
more max_active adjustments happens for each workqueue after the
update to the global state is visible. The global workqueue freezing
state and the max_active adjustments during workqueue creation and
[un]freezing are serialized with wq_pool_mutex, so it's trivial to
guarantee that max_actives stay in sync with global freezing state.POOL_FREEZING is unnecessary and makes the code more confusing and
complicates freeze_workqueues_begin() and thaw_workqueues() by
requiring them to walk through all pools.Remove POOL_FREEZING and use workqueue_freezing directly instead.
tj: Description and comment updates.
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo