Commit 94ca9d669a1308fefe476fde750c5297b6f86f3f
Exists in
master
and in
7 other branches
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: add documentation
Showing 3 changed files Side-by-side Diff
Documentation/workqueue.txt
1 | + | |
2 | +Concurrency Managed Workqueue (cmwq) | |
3 | + | |
4 | +September, 2010 Tejun Heo <tj@kernel.org> | |
5 | + Florian Mickler <florian@mickler.org> | |
6 | + | |
7 | +CONTENTS | |
8 | + | |
9 | +1. Introduction | |
10 | +2. Why cmwq? | |
11 | +3. The Design | |
12 | +4. Application Programming Interface (API) | |
13 | +5. Example Execution Scenarios | |
14 | +6. Guidelines | |
15 | + | |
16 | + | |
17 | +1. Introduction | |
18 | + | |
19 | +There are many cases where an asynchronous process execution context | |
20 | +is needed and the workqueue (wq) API is the most commonly used | |
21 | +mechanism for such cases. | |
22 | + | |
23 | +When such an asynchronous execution context is needed, a work item | |
24 | +describing which function to execute is put on a queue. An | |
25 | +independent thread serves as the asynchronous execution context. The | |
26 | +queue is called workqueue and the thread is called worker. | |
27 | + | |
28 | +While there are work items on the workqueue the worker executes the | |
29 | +functions associated with the work items one after the other. When | |
30 | +there is no work item left on the workqueue the worker becomes idle. | |
31 | +When a new work item gets queued, the worker begins executing again. | |
32 | + | |
33 | + | |
34 | +2. Why cmwq? | |
35 | + | |
36 | +In the original wq implementation, a multi threaded (MT) wq had one | |
37 | +worker thread per CPU and a single threaded (ST) wq had one worker | |
38 | +thread system-wide. A single MT wq needed to keep around the same | |
39 | +number of workers as the number of CPUs. The kernel grew a lot of MT | |
40 | +wq users over the years and with the number of CPU cores continuously | |
41 | +rising, some systems saturated the default 32k PID space just booting | |
42 | +up. | |
43 | + | |
44 | +Although MT wq wasted a lot of resource, the level of concurrency | |
45 | +provided was unsatisfactory. The limitation was common to both ST and | |
46 | +MT wq albeit less severe on MT. Each wq maintained its own separate | |
47 | +worker pool. A MT wq could provide only one execution context per CPU | |
48 | +while a ST wq one for the whole system. Work items had to compete for | |
49 | +those very limited execution contexts leading to various problems | |
50 | +including proneness to deadlocks around the single execution context. | |
51 | + | |
52 | +The tension between the provided level of concurrency and resource | |
53 | +usage also forced its users to make unnecessary tradeoffs like libata | |
54 | +choosing to use ST wq for polling PIOs and accepting an unnecessary | |
55 | +limitation that no two polling PIOs can progress at the same time. As | |
56 | +MT wq don't provide much better concurrency, users which require | |
57 | +higher level of concurrency, like async or fscache, had to implement | |
58 | +their own thread pool. | |
59 | + | |
60 | +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with | |
61 | +focus on the following goals. | |
62 | + | |
63 | +* Maintain compatibility with the original workqueue API. | |
64 | + | |
65 | +* Use per-CPU unified worker pools shared by all wq to provide | |
66 | + flexible level of concurrency on demand without wasting a lot of | |
67 | + resource. | |
68 | + | |
69 | +* Automatically regulate worker pool and level of concurrency so that | |
70 | + the API users don't need to worry about such details. | |
71 | + | |
72 | + | |
73 | +3. The Design | |
74 | + | |
75 | +In order to ease the asynchronous execution of functions a new | |
76 | +abstraction, the work item, is introduced. | |
77 | + | |
78 | +A work item is a simple struct that holds a pointer to the function | |
79 | +that is to be executed asynchronously. Whenever a driver or subsystem | |
80 | +wants a function to be executed asynchronously it has to set up a work | |
81 | +item pointing to that function and queue that work item on a | |
82 | +workqueue. | |
83 | + | |
84 | +Special purpose threads, called worker threads, execute the functions | |
85 | +off of the queue, one after the other. If no work is queued, the | |
86 | +worker threads become idle. These worker threads are managed in so | |
87 | +called thread-pools. | |
88 | + | |
89 | +The cmwq design differentiates between the user-facing workqueues that | |
90 | +subsystems and drivers queue work items on and the backend mechanism | |
91 | +which manages thread-pool and processes the queued work items. | |
92 | + | |
93 | +The backend is called gcwq. There is one gcwq for each possible CPU | |
94 | +and one gcwq to serve work items queued on unbound workqueues. | |
95 | + | |
96 | +Subsystems and drivers can create and queue work items through special | |
97 | +workqueue API functions as they see fit. They can influence some | |
98 | +aspects of the way the work items are executed by setting flags on the | |
99 | +workqueue they are putting the work item on. These flags include | |
100 | +things like CPU locality, reentrancy, concurrency limits and more. To | |
101 | +get a detailed overview refer to the API description of | |
102 | +alloc_workqueue() below. | |
103 | + | |
104 | +When a work item is queued to a workqueue, the target gcwq is | |
105 | +determined according to the queue parameters and workqueue attributes | |
106 | +and appended on the shared worklist of the gcwq. For example, unless | |
107 | +specifically overridden, a work item of a bound workqueue will be | |
108 | +queued on the worklist of exactly that gcwq that is associated to the | |
109 | +CPU the issuer is running on. | |
110 | + | |
111 | +For any worker pool implementation, managing the concurrency level | |
112 | +(how many execution contexts are active) is an important issue. cmwq | |
113 | +tries to keep the concurrency at a minimal but sufficient level. | |
114 | +Minimal to save resources and sufficient in that the system is used at | |
115 | +its full capacity. | |
116 | + | |
117 | +Each gcwq bound to an actual CPU implements concurrency management by | |
118 | +hooking into the scheduler. The gcwq is notified whenever an active | |
119 | +worker wakes up or sleeps and keeps track of the number of the | |
120 | +currently runnable workers. Generally, work items are not expected to | |
121 | +hog a CPU and consume many cycles. That means maintaining just enough | |
122 | +concurrency to prevent work processing from stalling should be | |
123 | +optimal. As long as there are one or more runnable workers on the | |
124 | +CPU, the gcwq doesn't start execution of a new work, but, when the | |
125 | +last running worker goes to sleep, it immediately schedules a new | |
126 | +worker so that the CPU doesn't sit idle while there are pending work | |
127 | +items. This allows using a minimal number of workers without losing | |
128 | +execution bandwidth. | |
129 | + | |
130 | +Keeping idle workers around doesn't cost other than the memory space | |
131 | +for kthreads, so cmwq holds onto idle ones for a while before killing | |
132 | +them. | |
133 | + | |
134 | +For an unbound wq, the above concurrency management doesn't apply and | |
135 | +the gcwq for the pseudo unbound CPU tries to start executing all work | |
136 | +items as soon as possible. The responsibility of regulating | |
137 | +concurrency level is on the users. There is also a flag to mark a | |
138 | +bound wq to ignore the concurrency management. Please refer to the | |
139 | +API section for details. | |
140 | + | |
141 | +Forward progress guarantee relies on that workers can be created when | |
142 | +more execution contexts are necessary, which in turn is guaranteed | |
143 | +through the use of rescue workers. All work items which might be used | |
144 | +on code paths that handle memory reclaim are required to be queued on | |
145 | +wq's that have a rescue-worker reserved for execution under memory | |
146 | +pressure. Else it is possible that the thread-pool deadlocks waiting | |
147 | +for execution contexts to free up. | |
148 | + | |
149 | + | |
150 | +4. Application Programming Interface (API) | |
151 | + | |
152 | +alloc_workqueue() allocates a wq. The original create_*workqueue() | |
153 | +functions are deprecated and scheduled for removal. alloc_workqueue() | |
154 | +takes three arguments - @name, @flags and @max_active. @name is the | |
155 | +name of the wq and also used as the name of the rescuer thread if | |
156 | +there is one. | |
157 | + | |
158 | +A wq no longer manages execution resources but serves as a domain for | |
159 | +forward progress guarantee, flush and work item attributes. @flags | |
160 | +and @max_active control how work items are assigned execution | |
161 | +resources, scheduled and executed. | |
162 | + | |
163 | +@flags: | |
164 | + | |
165 | + WQ_NON_REENTRANT | |
166 | + | |
167 | + By default, a wq guarantees non-reentrance only on the same | |
168 | + CPU. A work item may not be executed concurrently on the same | |
169 | + CPU by multiple workers but is allowed to be executed | |
170 | + concurrently on multiple CPUs. This flag makes sure | |
171 | + non-reentrance is enforced across all CPUs. Work items queued | |
172 | + to a non-reentrant wq are guaranteed to be executed by at most | |
173 | + one worker system-wide at any given time. | |
174 | + | |
175 | + WQ_UNBOUND | |
176 | + | |
177 | + Work items queued to an unbound wq are served by a special | |
178 | + gcwq which hosts workers which are not bound to any specific | |
179 | + CPU. This makes the wq behave as a simple execution context | |
180 | + provider without concurrency management. The unbound gcwq | |
181 | + tries to start execution of work items as soon as possible. | |
182 | + Unbound wq sacrifices locality but is useful for the following | |
183 | + cases. | |
184 | + | |
185 | + * Wide fluctuation in the concurrency level requirement is | |
186 | + expected and using bound wq may end up creating large number | |
187 | + of mostly unused workers across different CPUs as the issuer | |
188 | + hops through different CPUs. | |
189 | + | |
190 | + * Long running CPU intensive workloads which can be better | |
191 | + managed by the system scheduler. | |
192 | + | |
193 | + WQ_FREEZEABLE | |
194 | + | |
195 | + A freezeable wq participates in the freeze phase of the system | |
196 | + suspend operations. Work items on the wq are drained and no | |
197 | + new work item starts execution until thawed. | |
198 | + | |
199 | + WQ_RESCUER | |
200 | + | |
201 | + All wq which might be used in the memory reclaim paths _MUST_ | |
202 | + have this flag set. This reserves one worker exclusively for | |
203 | + the execution of this wq under memory pressure. | |
204 | + | |
205 | + WQ_HIGHPRI | |
206 | + | |
207 | + Work items of a highpri wq are queued at the head of the | |
208 | + worklist of the target gcwq and start execution regardless of | |
209 | + the current concurrency level. In other words, highpri work | |
210 | + items will always start execution as soon as execution | |
211 | + resource is available. | |
212 | + | |
213 | + Ordering among highpri work items is preserved - a highpri | |
214 | + work item queued after another highpri work item will start | |
215 | + execution after the earlier highpri work item starts. | |
216 | + | |
217 | + Although highpri work items are not held back by other | |
218 | + runnable work items, they still contribute to the concurrency | |
219 | + level. Highpri work items in runnable state will prevent | |
220 | + non-highpri work items from starting execution. | |
221 | + | |
222 | + This flag is meaningless for unbound wq. | |
223 | + | |
224 | + WQ_CPU_INTENSIVE | |
225 | + | |
226 | + Work items of a CPU intensive wq do not contribute to the | |
227 | + concurrency level. In other words, runnable CPU intensive | |
228 | + work items will not prevent other work items from starting | |
229 | + execution. This is useful for bound work items which are | |
230 | + expected to hog CPU cycles so that their execution is | |
231 | + regulated by the system scheduler. | |
232 | + | |
233 | + Although CPU intensive work items don't contribute to the | |
234 | + concurrency level, start of their executions is still | |
235 | + regulated by the concurrency management and runnable | |
236 | + non-CPU-intensive work items can delay execution of CPU | |
237 | + intensive work items. | |
238 | + | |
239 | + This flag is meaningless for unbound wq. | |
240 | + | |
241 | + WQ_HIGHPRI | WQ_CPU_INTENSIVE | |
242 | + | |
243 | + This combination makes the wq avoid interaction with | |
244 | + concurrency management completely and behave as a simple | |
245 | + per-CPU execution context provider. Work items queued on a | |
246 | + highpri CPU-intensive wq start execution as soon as resources | |
247 | + are available and don't affect execution of other work items. | |
248 | + | |
249 | +@max_active: | |
250 | + | |
251 | +@max_active determines the maximum number of execution contexts per | |
252 | +CPU which can be assigned to the work items of a wq. For example, | |
253 | +with @max_active of 16, at most 16 work items of the wq can be | |
254 | +executing at the same time per CPU. | |
255 | + | |
256 | +Currently, for a bound wq, the maximum limit for @max_active is 512 | |
257 | +and the default value used when 0 is specified is 256. For an unbound | |
258 | +wq, the limit is higher of 512 and 4 * num_possible_cpus(). These | |
259 | +values are chosen sufficiently high such that they are not the | |
260 | +limiting factor while providing protection in runaway cases. | |
261 | + | |
262 | +The number of active work items of a wq is usually regulated by the | |
263 | +users of the wq, more specifically, by how many work items the users | |
264 | +may queue at the same time. Unless there is a specific need for | |
265 | +throttling the number of active work items, specifying '0' is | |
266 | +recommended. | |
267 | + | |
268 | +Some users depend on the strict execution ordering of ST wq. The | |
269 | +combination of @max_active of 1 and WQ_UNBOUND is used to achieve this | |
270 | +behavior. Work items on such wq are always queued to the unbound gcwq | |
271 | +and only one work item can be active at any given time thus achieving | |
272 | +the same ordering property as ST wq. | |
273 | + | |
274 | + | |
275 | +5. Example Execution Scenarios | |
276 | + | |
277 | +The following example execution scenarios try to illustrate how cmwq | |
278 | +behave under different configurations. | |
279 | + | |
280 | + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. | |
281 | + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms | |
282 | + again before finishing. w1 and w2 burn CPU for 5ms then sleep for | |
283 | + 10ms. | |
284 | + | |
285 | +Ignoring all other tasks, works and processing overhead, and assuming | |
286 | +simple FIFO scheduling, the following is one highly simplified version | |
287 | +of possible sequences of events with the original wq. | |
288 | + | |
289 | + TIME IN MSECS EVENT | |
290 | + 0 w0 starts and burns CPU | |
291 | + 5 w0 sleeps | |
292 | + 15 w0 wakes up and burns CPU | |
293 | + 20 w0 finishes | |
294 | + 20 w1 starts and burns CPU | |
295 | + 25 w1 sleeps | |
296 | + 35 w1 wakes up and finishes | |
297 | + 35 w2 starts and burns CPU | |
298 | + 40 w2 sleeps | |
299 | + 50 w2 wakes up and finishes | |
300 | + | |
301 | +And with cmwq with @max_active >= 3, | |
302 | + | |
303 | + TIME IN MSECS EVENT | |
304 | + 0 w0 starts and burns CPU | |
305 | + 5 w0 sleeps | |
306 | + 5 w1 starts and burns CPU | |
307 | + 10 w1 sleeps | |
308 | + 10 w2 starts and burns CPU | |
309 | + 15 w2 sleeps | |
310 | + 15 w0 wakes up and burns CPU | |
311 | + 20 w0 finishes | |
312 | + 20 w1 wakes up and finishes | |
313 | + 25 w2 wakes up and finishes | |
314 | + | |
315 | +If @max_active == 2, | |
316 | + | |
317 | + TIME IN MSECS EVENT | |
318 | + 0 w0 starts and burns CPU | |
319 | + 5 w0 sleeps | |
320 | + 5 w1 starts and burns CPU | |
321 | + 10 w1 sleeps | |
322 | + 15 w0 wakes up and burns CPU | |
323 | + 20 w0 finishes | |
324 | + 20 w1 wakes up and finishes | |
325 | + 20 w2 starts and burns CPU | |
326 | + 25 w2 sleeps | |
327 | + 35 w2 wakes up and finishes | |
328 | + | |
329 | +Now, let's assume w1 and w2 are queued to a different wq q1 which has | |
330 | +WQ_HIGHPRI set, | |
331 | + | |
332 | + TIME IN MSECS EVENT | |
333 | + 0 w1 and w2 start and burn CPU | |
334 | + 5 w1 sleeps | |
335 | + 10 w2 sleeps | |
336 | + 10 w0 starts and burns CPU | |
337 | + 15 w0 sleeps | |
338 | + 15 w1 wakes up and finishes | |
339 | + 20 w2 wakes up and finishes | |
340 | + 25 w0 wakes up and burns CPU | |
341 | + 30 w0 finishes | |
342 | + | |
343 | +If q1 has WQ_CPU_INTENSIVE set, | |
344 | + | |
345 | + TIME IN MSECS EVENT | |
346 | + 0 w0 starts and burns CPU | |
347 | + 5 w0 sleeps | |
348 | + 5 w1 and w2 start and burn CPU | |
349 | + 10 w1 sleeps | |
350 | + 15 w2 sleeps | |
351 | + 15 w0 wakes up and burns CPU | |
352 | + 20 w0 finishes | |
353 | + 20 w1 wakes up and finishes | |
354 | + 25 w2 wakes up and finishes | |
355 | + | |
356 | + | |
357 | +6. Guidelines | |
358 | + | |
359 | +* Do not forget to use WQ_RESCUER if a wq may process work items which | |
360 | + are used during memory reclaim. Each wq with WQ_RESCUER set has one | |
361 | + rescuer thread reserved for it. If there is dependency among | |
362 | + multiple work items used during memory reclaim, they should be | |
363 | + queued to separate wq each with WQ_RESCUER. | |
364 | + | |
365 | +* Unless strict ordering is required, there is no need to use ST wq. | |
366 | + | |
367 | +* Unless there is a specific need, using 0 for @max_active is | |
368 | + recommended. In most use cases, concurrency level usually stays | |
369 | + well under the default limit. | |
370 | + | |
371 | +* A wq serves as a domain for forward progress guarantee (WQ_RESCUER), | |
372 | + flush and work item attributes. Work items which are not involved | |
373 | + in memory reclaim and don't need to be flushed as a part of a group | |
374 | + of work items, and don't require any special attribute, can use one | |
375 | + of the system wq. There is no difference in execution | |
376 | + characteristics between using a dedicated wq and a system wq. | |
377 | + | |
378 | +* Unless work items are expected to consume a huge amount of CPU | |
379 | + cycles, using a bound wq is usually beneficial due to the increased | |
380 | + level of locality in wq operations and work item execution. |
include/linux/workqueue.h
... | ... | @@ -235,6 +235,10 @@ |
235 | 235 | #define work_clear_pending(work) \ |
236 | 236 | clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) |
237 | 237 | |
238 | +/* | |
239 | + * Workqueue flags and constants. For details, please refer to | |
240 | + * Documentation/workqueue.txt. | |
241 | + */ | |
238 | 242 | enum { |
239 | 243 | WQ_NON_REENTRANT = 1 << 0, /* guarantee non-reentrance */ |
240 | 244 | WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ |
kernel/workqueue.c
1 | 1 | /* |
2 | - * linux/kernel/workqueue.c | |
2 | + * kernel/workqueue.c - generic async execution with shared worker pool | |
3 | 3 | * |
4 | - * Generic mechanism for defining kernel helper threads for running | |
5 | - * arbitrary tasks in process context. | |
4 | + * Copyright (C) 2002 Ingo Molnar | |
6 | 5 | * |
7 | - * Started by Ingo Molnar, Copyright (C) 2002 | |
6 | + * Derived from the taskqueue/keventd code by: | |
7 | + * David Woodhouse <dwmw2@infradead.org> | |
8 | + * Andrew Morton | |
9 | + * Kai Petzke <wpp@marie.physik.tu-berlin.de> | |
10 | + * Theodore Ts'o <tytso@mit.edu> | |
8 | 11 | * |
9 | - * Derived from the taskqueue/keventd code by: | |
12 | + * Made to use alloc_percpu by Christoph Lameter. | |
10 | 13 | * |
11 | - * David Woodhouse <dwmw2@infradead.org> | |
12 | - * Andrew Morton | |
13 | - * Kai Petzke <wpp@marie.physik.tu-berlin.de> | |
14 | - * Theodore Ts'o <tytso@mit.edu> | |
14 | + * Copyright (C) 2010 SUSE Linux Products GmbH | |
15 | + * Copyright (C) 2010 Tejun Heo <tj@kernel.org> | |
15 | 16 | * |
16 | - * Made to use alloc_percpu by Christoph Lameter. | |
17 | + * This is the generic async execution mechanism. Work items as are | |
18 | + * executed in process context. The worker pool is shared and | |
19 | + * automatically managed. There is one worker pool for each CPU and | |
20 | + * one extra for works which are better served by workers which are | |
21 | + * not bound to any specific CPU. | |
22 | + * | |
23 | + * Please read Documentation/workqueue.txt for details. | |
17 | 24 | */ |
18 | 25 | |
19 | 26 | #include <linux/module.h> |