Eric Lee / smarc-fsl-linux-kernel

19 Feb, 2014

1 commit

5bdfff96c workqueue: ensure @task is valid across kthread_stop() ... Browse Code »

When a kworker should die, the kworkre is notified through WORKER_DIE
flag instead of kthread_should_stop(). This, IIRC, is primarily to
keep the test synchronized inside worker_pool lock. WORKER_DIE is
first set while holding pool->lock, the lock is dropped and
kthread_stop() is called.

Unfortunately, this means that there's a slight chance that the target
kworker may see WORKER_DIE before kthread_stop() finishes and exits
and frees the target task before or during kthread_stop().

Fix it by pinning the target task before setting WORKER_DIE and
putting it after kthread_stop() is done.

tj: Improved patch description and comment. Moved pinning above
WORKER_DIE for better signify what it's protecting.

CC: stable@vger.kernel.org
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo

Lai Jiangshan
2014-02-19 05:35:20 +0800

22 Jan, 2014

1 commit

4a2829b97 Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue update from Tejun Heo:
"Just one patch to add destroy_work_on_stack() annotations to help
debugobj debugging"

* 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()

Linus Torvalds
2014-01-22 09:46:31 +0800

12 Jan, 2014

1 commit

440a11360 workqueue: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() ... Browse Code »

In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().

Signed-off-by: Liu, Chuansheng
Signed-off-by: Tejun Heo

Chuansheng Liu
2014-01-12 11:26:33 +0800

16 Dec, 2013

1 commit

9199c4caa Merge tag 'pci-v3.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci ... Browse Code »

Pull PCI updates from Bjorn Helgaas:
"PCI device hotplug
- Move device_del() from pci_stop_dev() to pci_destroy_dev() (Rafael
Wysocki)

Host bridge drivers
- Update maintainers for DesignWare, i.MX6, Armada, R-Car (Bjorn
Helgaas)
- mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
(Jason Gunthorpe)

Miscellaneous
- Avoid unnecessary CPU switch when calling .probe() (Alexander
Duyck)
- Revert "workqueue: allow work_on_cpu() to be called recursively"
(Bjorn Helgaas)
- Disable Bus Master only on kexec reboot (Khalid Aziz)
- Omit PCI ID macro strings to shorten quirk names for LTO (Michal
Marek)"

* tag 'pci-v3.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
MAINTAINERS: Add DesignWare, i.MX6, Armada, R-Car PCI host maintainers
PCI: Disable Bus Master only on kexec reboot
PCI: mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
PCI: Omit PCI ID macro strings to shorten quirk names
PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()
Revert "workqueue: allow work_on_cpu() to be called recursively"
PCI: Avoid unnecessary CPU switch when calling driver .probe() method

Linus Torvalds
2013-12-16 03:45:27 +0800

26 Nov, 2013

1 commit

12997d1a9 Revert "workqueue: allow work_on_cpu() to be called recursively" ... Browse Code »

This reverts commit c2fda509667b0fda4372a237f5a59ea4570b1627.

c2fda509667b removed lockdep annotation from work_on_cpu() to work around
the PCI path that calls work_on_cpu() from within a work_on_cpu() work item
(PF driver .probe() method -> pci_enable_sriov() -> add VFs -> VF driver
.probe method).

961da7fb6b22 ("PCI: Avoid unnecessary CPU switch when calling driver
.probe() method) avoids that recursive work_on_cpu() use in a different
way, so this revert restores the work_on_cpu() lockdep annotation.

Signed-off-by: Bjorn Helgaas
Acked-by: Tejun Heo

Bjorn Helgaas
2013-11-26 05:37:22 +0800

23 Nov, 2013

4 commits

4e8b22bd1 workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues ... Browse Code »

When one work starts execution, the high bits of work's data contain
pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID
is assigned WORK_OFFQ_POOL_NONE when the work being initialized
indicating that no pool is associated and get_work_pool() uses it to
check the associated pool. So if worker_pool_assign_id() assigns a
ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers
leakage, and it may break the non-reentrance guarantee.

This patch fix this issue by modifying the worker_pool_assign_id()
function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE.

Furthermore, in the current implementation, the BUILD_BUG_ON() in
init_workqueues makes no sense. The number of worker pools needed
cannot be determined at compile time, because the number of backing
pools for UNBOUND workqueues is dynamic based on the assigned custom
attributes. So remove it.

tj: Minor comment and indentation updates.

Signed-off-by: Li Bin
Signed-off-by: Tejun Heo

Li Bin
2013-11-23 07:14:47 +0800
9ef28a73f workqueue: fix comment typo for __queue_work() ... Browse Code »

It seems the "dying" should be "draining" here.

Signed-off-by: Li Bin
Signed-off-by: Tejun Heo

Li Bin
2013-11-23 07:14:27 +0800
8a2b75384 workqueue: fix ordered workqueues in NUMA setups ... Browse Code »

An ordered workqueue implements execution ordering by using single
pool_workqueue with max_active == 1. On a given pool_workqueue, work
items are processed in FIFO order and limiting max_active to 1
enforces the queued work items to be processed one by one.

Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for
unbound workqueues") accidentally broke this guarantee by applying
NUMA affinity to ordered workqueues too. On NUMA setups, an ordered
workqueue would end up with separate pool_workqueues for different
nodes. Each pool_workqueue still limits max_active to 1 but multiple
work items may be executed concurrently and out of order depending on
which node they are queued to.

Fix it by using dedicated ordered_wq_attrs[] when creating ordered
workqueues. The new attrs match the unbound ones except that no_numa
is always set thus forcing all NUMA nodes to share the default
pool_workqueue.

While at it, add sanity check in workqueue creation path which
verifies that an ordered workqueues has only the default
pool_workqueue.

Signed-off-by: Tejun Heo
Reported-by: Libin
Cc: stable@vger.kernel.org
Cc: Lai Jiangshan

Tejun Heo
2013-11-23 07:14:02 +0800
911512280 workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY ... Browse Code »

Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed()
in create_worker(). Otherwise userland can change ->cpus_allowed
in between.

Signed-off-by: Oleg Nesterov
Signed-off-by: Tejun Heo

Oleg Nesterov
2013-11-23 07:13:20 +0800

07 Sep, 2013

1 commit

2e515bf09 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree from Jiri Kosina:
"The usual trivial updates all over the tree -- mostly typo fixes and
documentation updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
doc: Documentation/cputopology.txt fix typo
treewide: Convert retrun typos to return
Fix comment typo for init_cma_reserved_pageblock
Documentation/trace: Correcting and extending tracepoint documentation
mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
power: Documentation: Update s2ram link
doc: fix a typo in Documentation/00-INDEX
Documentation/printk-formats.txt: No casts needed for u64/s64
doc: Fix typo "is is" in Documentations
treewide: Fix printks with 0x%#
zram: doc fixes
Documentation/kmemcheck: update kmemcheck documentation
doc: documentation/hwspinlock.txt fix typo
PM / Hibernate: add section for resume options
doc: filesystems : Fix typo in Documentations/filesystems
scsi/megaraid fixed several typos in comments
ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
page_isolation: Fix a comment typo in test_pages_isolated()
doc: fix a typo about irq affinity
...

Linus Torvalds
2013-09-07 00:36:28 +0800

04 Sep, 2013

2 commits

9ee52a163 Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue updates from Tejun Heo:
"Nothing interesting. All are doc / comment updates"

* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Correct/Drop references to gcwq in Documentation
workqueue: Fix manage_workers() RETURNS description
workqueue: Comment correction in file header
workqueue: mark WQ_NON_REENTRANT deprecated

Linus Torvalds
2013-09-04 09:19:21 +0800
542a086ac Merge tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core patches from Greg KH:
"Here's the big driver core pull request for 3.12-rc1.

Lots of tiny changes here fixing up the way sysfs attributes are
created, to try to make drivers simpler, and fix a whole class race
conditions with creations of device attributes after the device was
announced to userspace.

All the various pieces are acked by the different subsystem
maintainers"

* tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits)
firmware loader: fix pending_fw_head list corruption
drivers/base/memory.c: introduce help macro to_memory_block
dynamic debug: line queries failing due to uninitialized local variable
sysfs: sysfs_create_groups returns a value.
debugfs: provide debugfs_create_x64() when disabled
rbd: convert bus code to use bus_groups
firmware: dcdbas: use binary attribute groups
sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled
driver core: add #include to core files.
HID: convert bus code to use dev_groups
Input: serio: convert bus code to use drv_groups
Input: gameport: convert bus code to use drv_groups
driver core: firmware: use __ATTR_RW()
driver core: core: use DEVICE_ATTR_RO
driver core: bus: use DRIVER_ATTR_WO()
driver core: create write-only attribute macros for devices and drivers
sysfs: create __ATTR_WO()
driver-core: platform: convert bus code to use dev_groups
workqueue: convert bus code to use dev_groups
MEI: convert bus code to use dev_groups
...

Linus Torvalds
2013-09-04 02:37:15 +0800

29 Aug, 2013

1 commit

b22ce2785 workqueue: cond_resched() after processing each work item ... Browse Code »

If !PREEMPT, a kworker running work items back to back can hog CPU.
This becomes dangerous when a self-requeueing work item which is
waiting for something to happen races against stop_machine. Such
self-requeueing work item would requeue itself indefinitely hogging
the kworker and CPU it's running on while stop_machine would wait for
that CPU to enter stop_machine while preventing anything else from
happening on all other CPUs. The two would deadlock.

Jamie Liu reports that this deadlock scenario exists around
scsi_requeue_run_queue() and libata port multiplier support, where one
port may exclude command processing from other ports. With the right
timing, scsi_requeue_run_queue() can end up requeueing itself trying
to execute an IO which is asked to be retried while another device has
an exclusive access, which in turn can't make forward progress due to
stop_machine.

Fix it by invoking cond_resched() after executing each work item.

Signed-off-by: Tejun Heo
Reported-by: Jamie Liu
References: http://thread.gmane.org/gmane.linux.kernel/1552567
Cc: stable@vger.kernel.org
--
kernel/workqueue.c | 9 +++++++++
1 file changed, 9 insertions(+)

Tejun Heo
2013-08-29 21:19:28 +0800

24 Aug, 2013

1 commit

1a6661daf workqueue: convert bus code to use dev_groups ... Browse Code »

The dev_attrs field of struct bus_type is going away soon, dev_groups
should be used instead. This converts the workqueue bus code to use
the correct field.

Acked-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2013-08-24 05:38:12 +0800

21 Aug, 2013

2 commits

2d498db98 workqueue: Fix manage_workers() RETURNS description ... Browse Code »

No functional change. The comment of function manage_workers()
RETURNS description is obvious wrong, same as the CONTEXT.
Fix it.

Signed-off-by: Libin
Signed-off-by: Tejun Heo

Libin
2013-08-21 22:32:09 +0800
b11895c45 workqueue: Comment correction in file header ... Browse Code »

No functional change. There are two worker pools for each cpu in
current implementation (one for normal work items and the other for
high priority ones).

tj: Whitespace adjustments.

Signed-off-by: Libin
Signed-off-by: Tejun Heo

Libin
2013-08-21 22:32:09 +0800

20 Aug, 2013

1 commit

d185af300 workqueue: fix some scripts/kernel-doc warnings ... Browse Code »

When building the htmldocs (in verbose mode), scripts/kernel-doc reports the
following type of warnings:

Warning(kernel/workqueue.c:653): No description found for return value of
'get_work_pool'

Fix them by:
- Using "Return:" sections to introduce descriptions of return values
- Adding some missing descriptions

Signed-off-by: Yacine Belkadi
Signed-off-by: Jiri Kosina

Yacine Belkadi
2013-08-20 18:57:25 +0800

07 Aug, 2013

1 commit

4264bc137 Merge branch 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull two workqueue fixes from Tejun Heo:
"A lockdep notation update so that nested work_on_cpu() invocations
don't lead to spurious lockdep warnings and fix for an unbound attr
bug which made what's shown in sysfs deviate from the actual ones.
Both patches have pretty limited scope"

* 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: copy workqueue_attrs with all fields
workqueue: allow work_on_cpu() to be called recursively

Linus Torvalds
2013-08-07 04:58:34 +0800

01 Aug, 2013

1 commit

2865a8fb4 workqueue: copy workqueue_attrs with all fields ... Browse Code »

$echo '0' > /sys/bus/workqueue/devices/xxx/numa
$cat /sys/bus/workqueue/devices/xxx/numa

I got 1. It should be 0, the reason is copy_workqueue_attrs() called
in apply_workqueue_attrs() doesn't copy no_numa field.

Fix it by making copy_workqueue_attrs() copy ->no_numa too. This
would also make get_unbound_pool() set a pool's ->no_numa attribute
according to the workqueue attributes used when the pool was created.
While harmelss, as ->no_numa isn't a pool attribute, this is a bit
confusing. Clear it explicitly.

tj: Updated description and comments a bit.

Signed-off-by: Shaohua Li
Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org

Shaohua Li
2013-08-01 20:36:40 +0800

25 Jul, 2013

1 commit

c2fda5096 workqueue: allow work_on_cpu() to be called recursively ... Browse Code »

If the @fn call work_on_cpu() again, the lockdep will complain:

> [ INFO: possible recursive locking detected ]
> 3.11.0-rc1-lockdep-fix-a #6 Not tainted
> ---------------------------------------------
> kworker/0:1/142 is trying to acquire lock:
> ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0xb0
>
> but task is already holding lock:
> ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x169/0x610
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock((&wfc.work));
> lock((&wfc.work));
>
> *** DEADLOCK ***

It is false-positive lockdep report. In this sutiation,
the two "wfc"s of the two work_on_cpu() are different,
they are both on stack. flush_work() can't be deadlock.

To fix this, we need to avoid the lockdep checking in this case,
thus we instroduce a internal __flush_work() which skip the lockdep.

tj: Minor comment adjustment.

Signed-off-by: Lai Jiangshan
Reported-by: "Srivatsa S. Bhat"
Reported-by: Alexander Duyck
Signed-off-by: Tejun Heo

Lai Jiangshan
2013-07-25 00:24:25 +0800

15 Jul, 2013

1 commit

0db0628d9 kernel: delete __cpuinit usage from all core kernel files ... Browse Code »

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.

[1] https://lkml.org/lkml/2013/5/20/589

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2013-07-15 07:36:59 +0800

03 Jul, 2013

1 commit

f317ff9ee Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue changes from Tejun Heo:
"Surprisingly, Lai and I didn't break too many things implementing
custom pools and stuff last time around and there aren't any follow-up
changes necessary at this point.

The only change in this pull request is Viresh's patches to make some
per-cpu workqueues to behave as unbound workqueues dependent on a boot
param whose default can be configured via a config option. This leads
to higher processing overhead / lower bandwidth as more work items are
bounced across CPUs; however, it can lead to noticeable powersave in
certain configurations - ~10% w/ idlish constant workload on a
big.LITTLE configuration according to Viresh.

This is because per-cpu workqueues interfere with how the scheduler
perceives whether or not each CPU is idle by forcing pinned tasks on
them, which makes the scheduler's power-aware scheduling decisions
less effective.

Its effectiveness is likely less pronounced on homogenous
configurations and this type of optimization can probably be made
automatic; however, the changes are pretty minimal and the affected
workqueues are clearly marked, so it's an easy gain for some
configurations for the time being with pretty unintrusive changes."

* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
fbcon: queue work on power efficient wq
block: queue work on power efficient wq
PHYLIB: queue work on system_power_efficient_wq
workqueue: Add system wide power_efficient workqueues
workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues

Linus Torvalds
2013-07-03 10:53:30 +0800

16 May, 2013

1 commit

1be0c25da workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init() ... Browse Code »

wq_numa_init() builds per-node cpumasks which are later used to make
unbound workqueues NUMA-aware. The cpumasks are allocated using
alloc_cpumask_var_node() for all possible nodes. Unfortunately, on
machines with off-line nodes, this leads to NUMA-aware allocations on
existing bug offline nodes, which in turn triggers BUG in the memory
allocation code.

Fix it by using NUMA_NO_NODE for cpumask allocations for offline
nodes.

kernel BUG at include/linux/gfp.h:323!
invalid opcode: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1
Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011
task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000
RIP: 0010:[] [] new_slab+0x2ad/0x340
RSP: 0000:ffff880234603bf8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0
RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0
RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001
R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0
FS: 0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0
ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1
ffff880237816140 0000000000000000 ffff88023740e1c0
Call Trace:
[] __slab_alloc+0x330/0x4f2
[] kmem_cache_alloc_node_trace+0xa5/0x200
[] alloc_cpumask_var_node+0x28/0x90
[] wq_numa_init+0x10d/0x1be
[] init_workqueues+0x64/0x341
[] do_one_initcall+0xea/0x1a0
[] kernel_init_freeable+0xb7/0x1ec
[] kernel_init+0xe/0xf0
[] ret_from_fork+0x7c/0xb0
Code: 45 84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee

Signed-off-by: Tejun Heo
Reported-and-Tested-by: Lingzhu Xiang

Tejun Heo
2013-05-16 05:24:24 +0800

15 May, 2013

4 commits

ad7b1f841 workqueue: Make schedule_work() available again to non GPL modules ... Browse Code »

Commit 8425e3d5bdbe ("workqueue: inline trivial wrappers") changed
schedule_work() and schedule_delayed_work() to inline wrappers,
but these rely on some symbols that are EXPORT_SYMBOL_GPL, while
the original functions were EXPORT_SYMBOL. This has the effect of
changing the licensing requirement for these functions and making
them unavailable to non GPL modules.

Make them available again by removing the restriction on the
required symbols.

Signed-off-by: Marc Dionne
Signed-off-by: Tejun Heo

Marc Dionne
2013-05-15 02:52:51 +0800
8f174b117 workqueue: correct handling of the pool spin_lock ... Browse Code »

When we fail to mutex_trylock(), we release the pool spin_lock and do
mutex_lock(). After that, we should regrab the pool spin_lock, but,
regrabbing is missed in current code. So correct it.

Cc: Lai Jiangshan
Signed-off-by: Joonsoo Kim
Signed-off-by: Tejun Heo

Joonsoo Kim
2013-05-15 02:48:15 +0800
0668106ca workqueue: Add system wide power_efficient workqueues ... Browse Code »

This patch adds system wide workqueues aligned towards power saving. This is
done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to
'true'.

tj: updated comments a bit.

Signed-off-by: Viresh Kumar
Signed-off-by: Tejun Heo

Viresh Kumar
2013-05-15 01:50:06 +0800
cee22a150 workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues ... Browse Code »

Workqueues can be performance or power-oriented. Currently, most workqueues are
bound to the CPU they were created on. This gives good performance (due to cache
effects) at the cost of potentially waking up otherwise idle cores (Idle from
scheduler's perspective. Which may or may not be physically idle) just to
process some work. To save power, we can allow the work to be rescheduled on a
core that is already awake.

Workqueues created with the WQ_UNBOUND flag will allow some power savings.
However, we don't change the default behaviour of the system. To enable
power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to
be turned on. This option can also be overridden by the
workqueue.power_efficient boot parameter.

tj: Updated config description and comments. Renamed
CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT.

Signed-off-by: Viresh Kumar
Reviewed-by: Amit Kucheria
Signed-off-by: Tejun Heo

Viresh Kumar
2013-05-15 01:50:06 +0800

11 May, 2013

1 commit

d32518591 workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number ... Browse Code »

df2d5ae499 ("workqueue: map an unbound workqueues to multiple per-node
pool_workqueues") made unbound workqueues to map to multiple per-node
pool_workqueues and accordingly updated workqueue_contested() so that,
for unbound workqueues, it maps the specified @cpu to the NUMA node
number to obtain the matching pool_workqueue to query the congested
state.

Before this change, workqueue_congested() ignored @cpu for unbound
workqueues as there was only one pool_workqueue and some users
(fscache) called it with WORK_CPU_UNBOUND. After the commit, this
causes the following oops as WORK_CPU_UNBOUND gets translated to
garbage by cpu_to_node().

BUG: unable to handle kernel paging request at ffff8803598d98b8
IP: [] unbound_pwq_by_node+0xa1/0xfa
PGD 2421067 PUD 0
Oops: 0000 [#1] SMP
CPU: 1 PID: 2689 Comm: cat Tainted: GF 3.9.0-fsdevel+ #4
task: ffff88003d801040 ti: ffff880025806000 task.ti: ffff880025806000
RIP: 0010:[] [] unbound_pwq_by_node+0xa1/0xfa
RSP: 0018:ffff880025807ad8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff8800388a2400 RCX: 0000000000000003
RDX: ffff880025807fd8 RSI: ffffffff81a31420 RDI: ffff88003d8016e0
RBP: ffff880025807ae8 R08: ffff88003d801730 R09: ffffffffa00b4898
R10: ffffffff81044217 R11: ffff88003d801040 R12: 0000000064206e97
R13: ffff880036059d98 R14: ffff880038cc8080 R15: ffff880038cc82d0
FS: 00007f21afd9c740(0000) GS:ffff88003d100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8803598d98b8 CR3: 000000003df49000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff8800388a2400 0000000000000002 ffff880025807b18 ffffffff810442ce
ffffffff81044217 ffff880000000002 ffff8800371b4080 ffff88003d112ec0
ffff880025807b38 ffffffffa00810b0 ffff880036059d88 ffff880036059be8
Call Trace:
[] workqueue_congested+0xb7/0x12c
[] fscache_enqueue_object+0xb2/0xe8 [fscache]
[] __fscache_acquire_cookie+0x3b9/0x56c [fscache]
[] nfs_fscache_set_inode_cookie+0xee/0x132 [nfs]
[] do_open+0x9/0xd [nfs]
[] do_dentry_open+0x175/0x24b
[] finish_open+0x41/0x51

Fix it by using smp_processor_id() if @cpu is WORK_CPU_UNBOUND.

Signed-off-by: Tejun Heo
Reported-by: David Howells
Tested-and-Acked-by: David Howells

Tejun Heo
2013-05-11 02:10:17 +0800

01 May, 2013

1 commit

3d1cb2059 workqueue: include workqueue info when printing debug dump of a worker task ... Browse Code »

One of the problems that arise when converting dedicated custom
threadpool to workqueue is that the shared worker pool used by workqueue
anonimizes each worker making it more difficult to identify what the
worker was doing on which target from the output of sysrq-t or debug
dump from oops, BUG() and friends.

This patch implements set_worker_desc() which can be called from any
workqueue work function to set its description. When the worker task is
dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
and so on - the description will be printed out together with the
workqueue name and the worker function pointer.

The printing side is implemented by print_worker_info() which is called
from functions in task dump paths - sched_show_task() and
dump_stack_print_info(). print_worker_info() can be safely called on
any task in any state as long as the task struct itself is accessible.
It uses probe_*() functions to access worker fields. It may print
garbage if something went very wrong, but it wouldn't cause (another)
oops.

The description is currently limited to 24bytes including the
terminating \0. worker->desc_valid and workder->desc[] are added and
the 64 bytes marker which was already incorrect before adding the new
fields is moved to the correct position.

Here's an example dump with writeback updated to set the bdi name as
worker desc.

Hardware name: Bochs
Modules linked in:
Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
Workqueue: writeback bdi_writeback_workfn (flush-8:0)
ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
Call Trace:
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x7f/0xc0
[] warn_slowpath_null+0x1a/0x20
[] bdi_writeback_workfn+0x2a0/0x3b0
...

Signed-off-by: Tejun Heo
Cc: Peter Zijlstra
Cc: Ingo Molnar
Acked-by: Jan Kara
Cc: Oleg Nesterov
Cc: Jens Axboe
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2013-05-01 08:04:02 +0800

10 Apr, 2013

1 commit

cece95dfe workqueue: use kmem_cache_free() instead of kfree() ... Browse Code »

memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().

Signed-off-by: Wei Yongjun
Signed-off-by: Tejun Heo

Wei Yongjun
2013-04-10 02:33:40 +0800

04 Apr, 2013

1 commit

5c529597e workqueue: avoid false negative WARN_ON() in destroy_workqueue() ... Browse Code »

destroy_workqueue() performs several sanity checks before proceeding
with destruction of a workqueue. One of the checks verifies that
refcnt of each pwq (pool_workqueue) is over 1 as at that point there
should be no in-flight work items and the only holder of pwq refs is
the workqueue itself.

This worked fine as a workqueue used to hold only one reference to its
pwqs; however, since 4c16bd327c ("workqueue: implement NUMA affinity
for unbound workqueues"), a workqueue may hold multiple references to
its default pwq triggering this sanity check spuriously.

Fix it by not triggering the pwq->refcnt assertion on default pwqs.

An example spurious WARN trigger follows.

WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e()
Hardware name: 4286C12
Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video
Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29
Call Trace:
[] warn_slowpath_common+0x7c/0x93
[] warn_slowpath_null+0x22/0x24
[] destroy_workqueue+0x6a/0x13e
[] ext4_put_super+0x43/0x2c4
[] generic_shutdown_super+0x4b/0xb9
[] kill_block_super+0x22/0x60
[] deactivate_locked_super+0x2f/0x56
[] deactivate_super+0x2e/0x31
[] mntput_no_expire+0x103/0x108
[] sys_umount+0x2a2/0x2c4
[] sys_oldumount+0x1e/0x20
[] sysenter_do_call+0x12/0x38

tj: Rewrote description.

Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo
Reported-by: Fengguang Wu

Lai Jiangshan
2013-04-04 22:54:01 +0800

02 Apr, 2013

9 commits

229641a6f Merge tag 'v3.9-rc5' into wq/for-3.10 ... Browse Code »

Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues. Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway. Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.

The two conflicts and their resolutions:

* e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
worker_pool_assign_id() to use idr_alloc() instead of the old idr
interface. worker_pool_assign_id() goes through multiple locking
changes in wq/for-3.10 causing the following conflict.

static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;

<<<<<<< HEAD
lockdep_assert_held(&wq_pool_mutex);

do {
if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
} while (ret == -EAGAIN);
=======
mutex_lock(&worker_pool_idr_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0)
pool->id = ret;
mutex_unlock(&worker_pool_idr_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89

return ret < 0 ? ret : 0;
}

We want locking from the former and idr_alloc() usage from the
latter, which can be combined to the following.

static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;

lockdep_assert_held(&wq_pool_mutex);

ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0) {
pool->id = ret;
return 0;
}
return ret;
}

* eb2834285c ("workqueue: fix possible pool stall bug in
wq_unbind_fn()") updated wq_unbind_fn() such that it has single
larger for_each_std_worker_pool() loop instead of two separate loops
with a schedule() call inbetween. wq/for-3.10 renamed
pool->assoc_mutex to pool->manager_mutex causing the following
conflict (earlier function body and comments omitted for brevity).

static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
<<<<<<< HEAD
mutex_unlock(&pool->manager_mutex);
}
=======
mutex_unlock(&pool->assoc_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89

schedule();

<<<<<<< HEAD
for_each_cpu_worker_pool(pool, cpu)
=======
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
atomic_set(&pool->nr_running, 0);

spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}

The resolution is mostly trivial. We want the control flow of the
latter with the rename of the former.

static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
mutex_unlock(&pool->manager_mutex);

schedule();

atomic_set(&pool->nr_running, 0);

spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}

Signed-off-by: Tejun Heo

Tejun Heo
2013-04-02 09:45:36 +0800
d55262c4d workqueue: update sysfs interface to reflect NUMA awareness and a kernel param t… ... Browse Code »

…o disable NUMA affinity

Unbound workqueues are now NUMA aware. Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
node:pool_id pairs. This change is userland-visible but "pool_id"
hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
individual workqueues. This is implemented as attrs->no_numa whichn
is special in that it isn't part of a pool's attributes. It only
affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Tejun Heo
2013-04-02 02:23:38 +0800
4c16bd327 workqueue: implement NUMA affinity for unbound workqueues ... Browse Code »

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.

v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.

v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.

v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.

Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.

Documentation on CPU affinity guarantee on CPU_DOWN added.

All changes are suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo
Reviewed-by: Lai Jiangshan

Tejun Heo
2013-04-02 02:23:36 +0800
dce90d47c workqueue: introduce put_pwq_unlocked() ... Browse Code »

Factor out lock pool, put_pwq(), unlock sequence into
put_pwq_unlocked(). The two existing places are converted and there
will be more with NUMA affinity support.

This is to prepare for NUMA affinity support for unbound workqueues
and doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Reviewed-by: Lai Jiangshan

Tejun Heo
2013-04-02 02:23:35 +0800
1befcf307 workqueue: introduce numa_pwq_tbl_install() ... Browse Code »

Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
from apply_workqueue_attrs() into numa_pwq_tbl_install(). link_pwq()
is made safe to call multiple times. numa_pwq_tbl_install() links the
pwq, installs it into numa_pwq_tbl[] at the specified node and returns
the old entry.

@last_pwq is removed from link_pwq() as the return value of the new
function can be used instead.

This is to prepare for NUMA affinity support for unbound workqueues.

Signed-off-by: Tejun Heo
Reviewed-by: Lai Jiangshan

Tejun Heo
2013-04-02 02:23:35 +0800
e50aba9ae workqueue: use NUMA-aware allocation for pool_workqueues ... Browse Code »

Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool. As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo
Reviewed-by: Lai Jiangshan

Tejun Heo
2013-04-02 02:23:35 +0800
f147f29eb workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() ... Browse Code »

Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo
Reviewed-by: Lai Jiangshan

Tejun Heo
2013-04-02 02:23:35 +0800
df2d5ae49 workqueue: map an unbound workqueues to multiple per-node pool_workqueues ... Browse Code »

Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it. It may have multple pool_workqueues but only the
first pool_workqueue servies new work items. For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node. This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
may be called only with wq->mutex held.

Signed-off-by: Tejun Heo
Reviewed-by: Lai Jiangshan

Tejun Heo
2013-04-02 02:23:35 +0800
2728fd2f0 workqueue: move hot fields of workqueue_struct to the end ... Browse Code »

Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline. These two fields are used in the work item
issue path and thus hot. The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues. The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.

Signed-off-by: Tejun Heo
Reviewed-by: Lai Jiangshan

Tejun Heo
2013-04-02 02:23:35 +0800