Commit 1df2d017fe9d22a49bad157b4f5aa19212f29557
Exists in
master
and in
39 other branches
Merge branch 'docs-next' of git://git.lwn.net/linux-2.6
* 'docs-next' of git://git.lwn.net/linux-2.6: Fix a typo in the development process document. Document handling of bad memory Document RCU and unloadable modules
Showing 4 changed files Side-by-side Diff
Documentation/RCU/00-INDEX
Documentation/RCU/rcubarrier.txt
1 | +RCU and Unloadable Modules | |
2 | + | |
3 | +[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/] | |
4 | + | |
5 | +RCU (read-copy update) is a synchronization mechanism that can be thought | |
6 | +of as a replacement for read-writer locking (among other things), but with | |
7 | +very low-overhead readers that are immune to deadlock, priority inversion, | |
8 | +and unbounded latency. RCU read-side critical sections are delimited | |
9 | +by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPT | |
10 | +kernels, generate no code whatsoever. | |
11 | + | |
12 | +This means that RCU writers are unaware of the presence of concurrent | |
13 | +readers, so that RCU updates to shared data must be undertaken quite | |
14 | +carefully, leaving an old version of the data structure in place until all | |
15 | +pre-existing readers have finished. These old versions are needed because | |
16 | +such readers might hold a reference to them. RCU updates can therefore be | |
17 | +rather expensive, and RCU is thus best suited for read-mostly situations. | |
18 | + | |
19 | +How can an RCU writer possibly determine when all readers are finished, | |
20 | +given that readers might well leave absolutely no trace of their | |
21 | +presence? There is a synchronize_rcu() primitive that blocks until all | |
22 | +pre-existing readers have completed. An updater wishing to delete an | |
23 | +element p from a linked list might do the following, while holding an | |
24 | +appropriate lock, of course: | |
25 | + | |
26 | + list_del_rcu(p); | |
27 | + synchronize_rcu(); | |
28 | + kfree(p); | |
29 | + | |
30 | +But the above code cannot be used in IRQ context -- the call_rcu() | |
31 | +primitive must be used instead. This primitive takes a pointer to an | |
32 | +rcu_head struct placed within the RCU-protected data structure and | |
33 | +another pointer to a function that may be invoked later to free that | |
34 | +structure. Code to delete an element p from the linked list from IRQ | |
35 | +context might then be as follows: | |
36 | + | |
37 | + list_del_rcu(p); | |
38 | + call_rcu(&p->rcu, p_callback); | |
39 | + | |
40 | +Since call_rcu() never blocks, this code can safely be used from within | |
41 | +IRQ context. The function p_callback() might be defined as follows: | |
42 | + | |
43 | + static void p_callback(struct rcu_head *rp) | |
44 | + { | |
45 | + struct pstruct *p = container_of(rp, struct pstruct, rcu); | |
46 | + | |
47 | + kfree(p); | |
48 | + } | |
49 | + | |
50 | + | |
51 | +Unloading Modules That Use call_rcu() | |
52 | + | |
53 | +But what if p_callback is defined in an unloadable module? | |
54 | + | |
55 | +If we unload the module while some RCU callbacks are pending, | |
56 | +the CPUs executing these callbacks are going to be severely | |
57 | +disappointed when they are later invoked, as fancifully depicted at | |
58 | +http://lwn.net/images/ns/kernel/rcu-drop.jpg. | |
59 | + | |
60 | +We could try placing a synchronize_rcu() in the module-exit code path, | |
61 | +but this is not sufficient. Although synchronize_rcu() does wait for a | |
62 | +grace period to elapse, it does not wait for the callbacks to complete. | |
63 | + | |
64 | +One might be tempted to try several back-to-back synchronize_rcu() | |
65 | +calls, but this is still not guaranteed to work. If there is a very | |
66 | +heavy RCU-callback load, then some of the callbacks might be deferred | |
67 | +in order to allow other processing to proceed. Such deferral is required | |
68 | +in realtime kernels in order to avoid excessive scheduling latencies. | |
69 | + | |
70 | + | |
71 | +rcu_barrier() | |
72 | + | |
73 | +We instead need the rcu_barrier() primitive. This primitive is similar | |
74 | +to synchronize_rcu(), but instead of waiting solely for a grace | |
75 | +period to elapse, it also waits for all outstanding RCU callbacks to | |
76 | +complete. Pseudo-code using rcu_barrier() is as follows: | |
77 | + | |
78 | + 1. Prevent any new RCU callbacks from being posted. | |
79 | + 2. Execute rcu_barrier(). | |
80 | + 3. Allow the module to be unloaded. | |
81 | + | |
82 | +Quick Quiz #1: Why is there no srcu_barrier()? | |
83 | + | |
84 | +The rcutorture module makes use of rcu_barrier in its exit function | |
85 | +as follows: | |
86 | + | |
87 | + 1 static void | |
88 | + 2 rcu_torture_cleanup(void) | |
89 | + 3 { | |
90 | + 4 int i; | |
91 | + 5 | |
92 | + 6 fullstop = 1; | |
93 | + 7 if (shuffler_task != NULL) { | |
94 | + 8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task"); | |
95 | + 9 kthread_stop(shuffler_task); | |
96 | +10 } | |
97 | +11 shuffler_task = NULL; | |
98 | +12 | |
99 | +13 if (writer_task != NULL) { | |
100 | +14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task"); | |
101 | +15 kthread_stop(writer_task); | |
102 | +16 } | |
103 | +17 writer_task = NULL; | |
104 | +18 | |
105 | +19 if (reader_tasks != NULL) { | |
106 | +20 for (i = 0; i < nrealreaders; i++) { | |
107 | +21 if (reader_tasks[i] != NULL) { | |
108 | +22 VERBOSE_PRINTK_STRING( | |
109 | +23 "Stopping rcu_torture_reader task"); | |
110 | +24 kthread_stop(reader_tasks[i]); | |
111 | +25 } | |
112 | +26 reader_tasks[i] = NULL; | |
113 | +27 } | |
114 | +28 kfree(reader_tasks); | |
115 | +29 reader_tasks = NULL; | |
116 | +30 } | |
117 | +31 rcu_torture_current = NULL; | |
118 | +32 | |
119 | +33 if (fakewriter_tasks != NULL) { | |
120 | +34 for (i = 0; i < nfakewriters; i++) { | |
121 | +35 if (fakewriter_tasks[i] != NULL) { | |
122 | +36 VERBOSE_PRINTK_STRING( | |
123 | +37 "Stopping rcu_torture_fakewriter task"); | |
124 | +38 kthread_stop(fakewriter_tasks[i]); | |
125 | +39 } | |
126 | +40 fakewriter_tasks[i] = NULL; | |
127 | +41 } | |
128 | +42 kfree(fakewriter_tasks); | |
129 | +43 fakewriter_tasks = NULL; | |
130 | +44 } | |
131 | +45 | |
132 | +46 if (stats_task != NULL) { | |
133 | +47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task"); | |
134 | +48 kthread_stop(stats_task); | |
135 | +49 } | |
136 | +50 stats_task = NULL; | |
137 | +51 | |
138 | +52 /* Wait for all RCU callbacks to fire. */ | |
139 | +53 rcu_barrier(); | |
140 | +54 | |
141 | +55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */ | |
142 | +56 | |
143 | +57 if (cur_ops->cleanup != NULL) | |
144 | +58 cur_ops->cleanup(); | |
145 | +59 if (atomic_read(&n_rcu_torture_error)) | |
146 | +60 rcu_torture_print_module_parms("End of test: FAILURE"); | |
147 | +61 else | |
148 | +62 rcu_torture_print_module_parms("End of test: SUCCESS"); | |
149 | +63 } | |
150 | + | |
151 | +Line 6 sets a global variable that prevents any RCU callbacks from | |
152 | +re-posting themselves. This will not be necessary in most cases, since | |
153 | +RCU callbacks rarely include calls to call_rcu(). However, the rcutorture | |
154 | +module is an exception to this rule, and therefore needs to set this | |
155 | +global variable. | |
156 | + | |
157 | +Lines 7-50 stop all the kernel tasks associated with the rcutorture | |
158 | +module. Therefore, once execution reaches line 53, no more rcutorture | |
159 | +RCU callbacks will be posted. The rcu_barrier() call on line 53 waits | |
160 | +for any pre-existing callbacks to complete. | |
161 | + | |
162 | +Then lines 55-62 print status and do operation-specific cleanup, and | |
163 | +then return, permitting the module-unload operation to be completed. | |
164 | + | |
165 | +Quick Quiz #2: Is there any other situation where rcu_barrier() might | |
166 | + be required? | |
167 | + | |
168 | +Your module might have additional complications. For example, if your | |
169 | +module invokes call_rcu() from timers, you will need to first cancel all | |
170 | +the timers, and only then invoke rcu_barrier() to wait for any remaining | |
171 | +RCU callbacks to complete. | |
172 | + | |
173 | + | |
174 | +Implementing rcu_barrier() | |
175 | + | |
176 | +Dipankar Sarma's implementation of rcu_barrier() makes use of the fact | |
177 | +that RCU callbacks are never reordered once queued on one of the per-CPU | |
178 | +queues. His implementation queues an RCU callback on each of the per-CPU | |
179 | +callback queues, and then waits until they have all started executing, at | |
180 | +which point, all earlier RCU callbacks are guaranteed to have completed. | |
181 | + | |
182 | +The original code for rcu_barrier() was as follows: | |
183 | + | |
184 | + 1 void rcu_barrier(void) | |
185 | + 2 { | |
186 | + 3 BUG_ON(in_interrupt()); | |
187 | + 4 /* Take cpucontrol mutex to protect against CPU hotplug */ | |
188 | + 5 mutex_lock(&rcu_barrier_mutex); | |
189 | + 6 init_completion(&rcu_barrier_completion); | |
190 | + 7 atomic_set(&rcu_barrier_cpu_count, 0); | |
191 | + 8 on_each_cpu(rcu_barrier_func, NULL, 0, 1); | |
192 | + 9 wait_for_completion(&rcu_barrier_completion); | |
193 | +10 mutex_unlock(&rcu_barrier_mutex); | |
194 | +11 } | |
195 | + | |
196 | +Line 3 verifies that the caller is in process context, and lines 5 and 10 | |
197 | +use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the | |
198 | +global completion and counters at a time, which are initialized on lines | |
199 | +6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is | |
200 | +shown below. Note that the final "1" in on_each_cpu()'s argument list | |
201 | +ensures that all the calls to rcu_barrier_func() will have completed | |
202 | +before on_each_cpu() returns. Line 9 then waits for the completion. | |
203 | + | |
204 | +This code was rewritten in 2008 to support rcu_barrier_bh() and | |
205 | +rcu_barrier_sched() in addition to the original rcu_barrier(). | |
206 | + | |
207 | +The rcu_barrier_func() runs on each CPU, where it invokes call_rcu() | |
208 | +to post an RCU callback, as follows: | |
209 | + | |
210 | + 1 static void rcu_barrier_func(void *notused) | |
211 | + 2 { | |
212 | + 3 int cpu = smp_processor_id(); | |
213 | + 4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu); | |
214 | + 5 struct rcu_head *head; | |
215 | + 6 | |
216 | + 7 head = &rdp->barrier; | |
217 | + 8 atomic_inc(&rcu_barrier_cpu_count); | |
218 | + 9 call_rcu(head, rcu_barrier_callback); | |
219 | +10 } | |
220 | + | |
221 | +Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure, | |
222 | +which contains the struct rcu_head that needed for the later call to | |
223 | +call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line | |
224 | +8 increments a global counter. This counter will later be decremented | |
225 | +by the callback. Line 9 then registers the rcu_barrier_callback() on | |
226 | +the current CPU's queue. | |
227 | + | |
228 | +The rcu_barrier_callback() function simply atomically decrements the | |
229 | +rcu_barrier_cpu_count variable and finalizes the completion when it | |
230 | +reaches zero, as follows: | |
231 | + | |
232 | + 1 static void rcu_barrier_callback(struct rcu_head *notused) | |
233 | + 2 { | |
234 | + 3 if (atomic_dec_and_test(&rcu_barrier_cpu_count)) | |
235 | + 4 complete(&rcu_barrier_completion); | |
236 | + 5 } | |
237 | + | |
238 | +Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes | |
239 | + immediately (thus incrementing rcu_barrier_cpu_count to the | |
240 | + value one), but the other CPU's rcu_barrier_func() invocations | |
241 | + are delayed for a full grace period? Couldn't this result in | |
242 | + rcu_barrier() returning prematurely? | |
243 | + | |
244 | + | |
245 | +rcu_barrier() Summary | |
246 | + | |
247 | +The rcu_barrier() primitive has seen relatively little use, since most | |
248 | +code using RCU is in the core kernel rather than in modules. However, if | |
249 | +you are using RCU from an unloadable module, you need to use rcu_barrier() | |
250 | +so that your module may be safely unloaded. | |
251 | + | |
252 | + | |
253 | +Answers to Quick Quizzes | |
254 | + | |
255 | +Quick Quiz #1: Why is there no srcu_barrier()? | |
256 | + | |
257 | +Answer: Since there is no call_srcu(), there can be no outstanding SRCU | |
258 | + callbacks. Therefore, there is no need to wait for them. | |
259 | + | |
260 | +Quick Quiz #2: Is there any other situation where rcu_barrier() might | |
261 | + be required? | |
262 | + | |
263 | +Answer: Interestingly enough, rcu_barrier() was not originally | |
264 | + implemented for module unloading. Nikita Danilov was using | |
265 | + RCU in a filesystem, which resulted in a similar situation at | |
266 | + filesystem-unmount time. Dipankar Sarma coded up rcu_barrier() | |
267 | + in response, so that Nikita could invoke it during the | |
268 | + filesystem-unmount process. | |
269 | + | |
270 | + Much later, yours truly hit the RCU module-unload problem when | |
271 | + implementing rcutorture, and found that rcu_barrier() solves | |
272 | + this problem as well. | |
273 | + | |
274 | +Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes | |
275 | + immediately (thus incrementing rcu_barrier_cpu_count to the | |
276 | + value one), but the other CPU's rcu_barrier_func() invocations | |
277 | + are delayed for a full grace period? Couldn't this result in | |
278 | + rcu_barrier() returning prematurely? | |
279 | + | |
280 | +Answer: This cannot happen. The reason is that on_each_cpu() has its last | |
281 | + argument, the wait flag, set to "1". This flag is passed through | |
282 | + to smp_call_function() and further to smp_call_function_on_cpu(), | |
283 | + causing this latter to spin until the cross-CPU invocation of | |
284 | + rcu_barrier_func() has completed. This by itself would prevent | |
285 | + a grace period from completing on non-CONFIG_PREEMPT kernels, | |
286 | + since each CPU must undergo a context switch (or other quiescent | |
287 | + state) before the grace period can complete. However, this is | |
288 | + of no use in CONFIG_PREEMPT kernels. | |
289 | + | |
290 | + Therefore, on_each_cpu() disables preemption across its call | |
291 | + to smp_call_function() and also across the local call to | |
292 | + rcu_barrier_func(). This prevents the local CPU from context | |
293 | + switching, again preventing grace periods from completing. This | |
294 | + means that all CPUs have executed rcu_barrier_func() before | |
295 | + the first rcu_barrier_callback() can possibly execute, in turn | |
296 | + preventing rcu_barrier_cpu_count from prematurely reaching zero. | |
297 | + | |
298 | + Currently, -rt implementations of RCU keep but a single global | |
299 | + queue for RCU callbacks, and thus do not suffer from this | |
300 | + problem. However, when the -rt RCU eventually does have per-CPU | |
301 | + callback queues, things will have to change. One simple change | |
302 | + is to add an rcu_read_lock() before line 8 of rcu_barrier() | |
303 | + and an rcu_read_unlock() after line 8 of this same function. If | |
304 | + you can think of a better change, please let me know! |
Documentation/bad_memory.txt
1 | +March 2008 | |
2 | +Jan-Simon Moeller, dl9pf@gmx.de | |
3 | + | |
4 | + | |
5 | +How to deal with bad memory e.g. reported by memtest86+ ? | |
6 | +######################################################### | |
7 | + | |
8 | +There are three possibilities I know of: | |
9 | + | |
10 | +1) Reinsert/swap the memory modules | |
11 | + | |
12 | +2) Buy new modules (best!) or try to exchange the memory | |
13 | + if you have spare-parts | |
14 | + | |
15 | +3) Use BadRAM or memmap | |
16 | + | |
17 | +This Howto is about number 3) . | |
18 | + | |
19 | + | |
20 | +BadRAM | |
21 | +###### | |
22 | +BadRAM is the actively developed and available as kernel-patch | |
23 | +here: http://rick.vanrein.org/linux/badram/ | |
24 | + | |
25 | +For more details see the BadRAM documentation. | |
26 | + | |
27 | +memmap | |
28 | +###### | |
29 | + | |
30 | +memmap is already in the kernel and usable as kernel-parameter at | |
31 | +boot-time. Its syntax is slightly strange and you may need to | |
32 | +calculate the values by yourself! | |
33 | + | |
34 | +Syntax to exclude a memory area (see kernel-parameters.txt for details): | |
35 | +memmap=<size>$<address> | |
36 | + | |
37 | +Example: memtest86+ reported here errors at address 0x18691458, 0x18698424 and | |
38 | + some others. All had 0x1869xxxx in common, so I chose a pattern of | |
39 | + 0x18690000,0xffff0000. | |
40 | + | |
41 | +With the numbers of the example above: | |
42 | +memmap=64K$0x18690000 | |
43 | + or | |
44 | +memmap=0x10000$0x18690000 |
Documentation/development-process/4.Coding
... | ... | @@ -375,11 +375,11 @@ |
375 | 375 | justification is solid. |
376 | 376 | |
377 | 377 | When making an incompatible API change, one should, whenever possible, |
378 | -ensure that code which has not been updated is caught by the compiler. | |
378 | +ensure that code which has not been updated is caught by the compiler. | |
379 | 379 | This will help you to be sure that you have found all in-tree uses of that |
380 | 380 | interface. It will also alert developers of out-of-tree code that there is |
381 | 381 | a change that they need to respond to. Supporting out-of-tree code is not |
382 | 382 | something that kernel developers need to be worried about, but we also do |
383 | -not have to make life harder for out-of-tree developers than it it needs to | |
384 | -be. | |
383 | +not have to make life harder for out-of-tree developers than it needs to | |
384 | +be. |