Commit e594c8de3bd4e7732ed3340fb01e18ec94b12df2
1 parent
456db8cc45
Exists in
master
and in
4 other branches
kmemcheck: add the kmemcheck documentation
Thanks to Sitsofe Wheeler, Randy Dunlap, and Jonathan Corbet for providing input and feedback on this! Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Showing 1 changed file with 773 additions and 0 deletions Side-by-side Diff
Documentation/kmemcheck.txt
1 | +GETTING STARTED WITH KMEMCHECK | |
2 | +============================== | |
3 | + | |
4 | +Vegard Nossum <vegardno@ifi.uio.no> | |
5 | + | |
6 | + | |
7 | +Contents | |
8 | +======== | |
9 | +0. Introduction | |
10 | +1. Downloading | |
11 | +2. Configuring and compiling | |
12 | +3. How to use | |
13 | +3.1. Booting | |
14 | +3.2. Run-time enable/disable | |
15 | +3.3. Debugging | |
16 | +3.4. Annotating false positives | |
17 | +4. Reporting errors | |
18 | +5. Technical description | |
19 | + | |
20 | + | |
21 | +0. Introduction | |
22 | +=============== | |
23 | + | |
24 | +kmemcheck is a debugging feature for the Linux Kernel. More specifically, it | |
25 | +is a dynamic checker that detects and warns about some uses of uninitialized | |
26 | +memory. | |
27 | + | |
28 | +Userspace programmers might be familiar with Valgrind's memcheck. The main | |
29 | +difference between memcheck and kmemcheck is that memcheck works for userspace | |
30 | +programs only, and kmemcheck works for the kernel only. The implementations | |
31 | +are of course vastly different. Because of this, kmemcheck is not as accurate | |
32 | +as memcheck, but it turns out to be good enough in practice to discover real | |
33 | +programmer errors that the compiler is not able to find through static | |
34 | +analysis. | |
35 | + | |
36 | +Enabling kmemcheck on a kernel will probably slow it down to the extent that | |
37 | +the machine will not be usable for normal workloads such as e.g. an | |
38 | +interactive desktop. kmemcheck will also cause the kernel to use about twice | |
39 | +as much memory as normal. For this reason, kmemcheck is strictly a debugging | |
40 | +feature. | |
41 | + | |
42 | + | |
43 | +1. Downloading | |
44 | +============== | |
45 | + | |
46 | +kmemcheck can only be downloaded using git. If you want to write patches | |
47 | +against the current code, you should use the kmemcheck development branch of | |
48 | +the tip tree. It is also possible to use the linux-next tree, which also | |
49 | +includes the latest version of kmemcheck. | |
50 | + | |
51 | +Assuming that you've already cloned the linux-2.6.git repository, all you | |
52 | +have to do is add the -tip tree as a remote, like this: | |
53 | + | |
54 | + $ git remote add tip git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git | |
55 | + | |
56 | +To actually download the tree, fetch the remote: | |
57 | + | |
58 | + $ git fetch tip | |
59 | + | |
60 | +And to check out a new local branch with the kmemcheck code: | |
61 | + | |
62 | + $ git checkout -b kmemcheck tip/kmemcheck | |
63 | + | |
64 | +General instructions for the -tip tree can be found here: | |
65 | +http://people.redhat.com/mingo/tip.git/readme.txt | |
66 | + | |
67 | + | |
68 | +2. Configuring and compiling | |
69 | +============================ | |
70 | + | |
71 | +kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of | |
72 | +configuration variables must have specific settings in order for the kmemcheck | |
73 | +menu to even appear in "menuconfig". These are: | |
74 | + | |
75 | + o CONFIG_CC_OPTIMIZE_FOR_SIZE=n | |
76 | + | |
77 | + This option is located under "General setup" / "Optimize for size". | |
78 | + | |
79 | + Without this, gcc will use certain optimizations that usually lead to | |
80 | + false positive warnings from kmemcheck. An example of this is a 16-bit | |
81 | + field in a struct, where gcc may load 32 bits, then discard the upper | |
82 | + 16 bits. kmemcheck sees only the 32-bit load, and may trigger a | |
83 | + warning for the upper 16 bits (if they're uninitialized). | |
84 | + | |
85 | + o CONFIG_SLAB=y or CONFIG_SLUB=y | |
86 | + | |
87 | + This option is located under "General setup" / "Choose SLAB | |
88 | + allocator". | |
89 | + | |
90 | + o CONFIG_FUNCTION_TRACER=n | |
91 | + | |
92 | + This option is located under "Kernel hacking" / "Tracers" / "Kernel | |
93 | + Function Tracer" | |
94 | + | |
95 | + When function tracing is compiled in, gcc emits a call to another | |
96 | + function at the beginning of every function. This means that when the | |
97 | + page fault handler is called, the ftrace framework will be called | |
98 | + before kmemcheck has had a chance to handle the fault. If ftrace then | |
99 | + modifies memory that was tracked by kmemcheck, the result is an | |
100 | + endless recursive page fault. | |
101 | + | |
102 | + o CONFIG_DEBUG_PAGEALLOC=n | |
103 | + | |
104 | + This option is located under "Kernel hacking" / "Debug page memory | |
105 | + allocations". | |
106 | + | |
107 | +In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also | |
108 | +located under "Kernel hacking". With this, you will be able to get line number | |
109 | +information from the kmemcheck warnings, which is extremely valuable in | |
110 | +debugging a problem. This option is not mandatory, however, because it slows | |
111 | +down the compilation process and produces a much bigger kernel image. | |
112 | + | |
113 | +Now the kmemcheck menu should be visible (under "Kernel hacking" / "kmemcheck: | |
114 | +trap use of uninitialized memory"). Here follows a description of the | |
115 | +kmemcheck configuration variables: | |
116 | + | |
117 | + o CONFIG_KMEMCHECK | |
118 | + | |
119 | + This must be enabled in order to use kmemcheck at all... | |
120 | + | |
121 | + o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT | |
122 | + | |
123 | + This option controls the status of kmemcheck at boot-time. "Enabled" | |
124 | + will enable kmemcheck right from the start, "disabled" will boot the | |
125 | + kernel as normal (but with the kmemcheck code compiled in, so it can | |
126 | + be enabled at run-time after the kernel has booted), and "one-shot" is | |
127 | + a special mode which will turn kmemcheck off automatically after | |
128 | + detecting the first use of uninitialized memory. | |
129 | + | |
130 | + If you are using kmemcheck to actively debug a problem, then you | |
131 | + probably want to choose "enabled" here. | |
132 | + | |
133 | + The one-shot mode is mostly useful in automated test setups because it | |
134 | + can prevent floods of warnings and increase the chances of the machine | |
135 | + surviving in case something is really wrong. In other cases, the one- | |
136 | + shot mode could actually be counter-productive because it would turn | |
137 | + itself off at the very first error -- in the case of a false positive | |
138 | + too -- and this would come in the way of debugging the specific | |
139 | + problem you were interested in. | |
140 | + | |
141 | + If you would like to use your kernel as normal, but with a chance to | |
142 | + enable kmemcheck in case of some problem, it might be a good idea to | |
143 | + choose "disabled" here. When kmemcheck is disabled, most of the run- | |
144 | + time overhead is not incurred, and the kernel will be almost as fast | |
145 | + as normal. | |
146 | + | |
147 | + o CONFIG_KMEMCHECK_QUEUE_SIZE | |
148 | + | |
149 | + Select the maximum number of error reports to store in an internal | |
150 | + (fixed-size) buffer. Since errors can occur virtually anywhere and in | |
151 | + any context, we need a temporary storage area which is guaranteed not | |
152 | + to generate any other page faults when accessed. The queue will be | |
153 | + emptied as soon as a tasklet may be scheduled. If the queue is full, | |
154 | + new error reports will be lost. | |
155 | + | |
156 | + The default value of 64 is probably fine. If some code produces more | |
157 | + than 64 errors within an irqs-off section, then the code is likely to | |
158 | + produce many, many more, too, and these additional reports seldom give | |
159 | + any more information (the first report is usually the most valuable | |
160 | + anyway). | |
161 | + | |
162 | + This number might have to be adjusted if you are not using serial | |
163 | + console or similar to capture the kernel log. If you are using the | |
164 | + "dmesg" command to save the log, then getting a lot of kmemcheck | |
165 | + warnings might overflow the kernel log itself, and the earlier reports | |
166 | + will get lost in that way instead. Try setting this to 10 or so on | |
167 | + such a setup. | |
168 | + | |
169 | + o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT | |
170 | + | |
171 | + Select the number of shadow bytes to save along with each entry of the | |
172 | + error-report queue. These bytes indicate what parts of an allocation | |
173 | + are initialized, uninitialized, etc. and will be displayed when an | |
174 | + error is detected to help the debugging of a particular problem. | |
175 | + | |
176 | + The number entered here is actually the logarithm of the number of | |
177 | + bytes that will be saved. So if you pick for example 5 here, kmemcheck | |
178 | + will save 2^5 = 32 bytes. | |
179 | + | |
180 | + The default value should be fine for debugging most problems. It also | |
181 | + fits nicely within 80 columns. | |
182 | + | |
183 | + o CONFIG_KMEMCHECK_PARTIAL_OK | |
184 | + | |
185 | + This option (when enabled) works around certain GCC optimizations that | |
186 | + produce 32-bit reads from 16-bit variables where the upper 16 bits are | |
187 | + thrown away afterwards. | |
188 | + | |
189 | + The default value (enabled) is recommended. This may of course hide | |
190 | + some real errors, but disabling it would probably produce a lot of | |
191 | + false positives. | |
192 | + | |
193 | + o CONFIG_KMEMCHECK_BITOPS_OK | |
194 | + | |
195 | + This option silences warnings that would be generated for bit-field | |
196 | + accesses where not all the bits are initialized at the same time. This | |
197 | + may also hide some real bugs. | |
198 | + | |
199 | + This option is probably obsolete, or it should be replaced with | |
200 | + the kmemcheck-/bitfield-annotations for the code in question. The | |
201 | + default value is therefore fine. | |
202 | + | |
203 | +Now compile the kernel as usual. | |
204 | + | |
205 | + | |
206 | +3. How to use | |
207 | +============= | |
208 | + | |
209 | +3.1. Booting | |
210 | +============ | |
211 | + | |
212 | +First some information about the command-line options. There is only one | |
213 | +option specific to kmemcheck, and this is called "kmemcheck". It can be used | |
214 | +to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT | |
215 | +option. Its possible settings are: | |
216 | + | |
217 | + o kmemcheck=0 (disabled) | |
218 | + o kmemcheck=1 (enabled) | |
219 | + o kmemcheck=2 (one-shot mode) | |
220 | + | |
221 | +If SLUB debugging has been enabled in the kernel, it may take precedence over | |
222 | +kmemcheck in such a way that the slab caches which are under SLUB debugging | |
223 | +will not be tracked by kmemcheck. In order to ensure that this doesn't happen | |
224 | +(even though it shouldn't by default), use SLUB's boot option "slub_debug", | |
225 | +like this: slub_debug=- | |
226 | + | |
227 | +In fact, this option may also be used for fine-grained control over SLUB vs. | |
228 | +kmemcheck. For example, if the command line includes "kmemcheck=1 | |
229 | +slub_debug=,dentry", then SLUB debugging will be used only for the "dentry" | |
230 | +slab cache, and with kmemcheck tracking all the other caches. This is advanced | |
231 | +usage, however, and is not generally recommended. | |
232 | + | |
233 | + | |
234 | +3.2. Run-time enable/disable | |
235 | +============================ | |
236 | + | |
237 | +When the kernel has booted, it is possible to enable or disable kmemcheck at | |
238 | +run-time. WARNING: This feature is still experimental and may cause false | |
239 | +positive warnings to appear. Therefore, try not to use this. If you find that | |
240 | +it doesn't work properly (e.g. you see an unreasonable amount of warnings), I | |
241 | +will be happy to take bug reports. | |
242 | + | |
243 | +Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.: | |
244 | + | |
245 | + $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck | |
246 | + | |
247 | +The numbers are the same as for the kmemcheck= command-line option. | |
248 | + | |
249 | + | |
250 | +3.3. Debugging | |
251 | +============== | |
252 | + | |
253 | +A typical report will look something like this: | |
254 | + | |
255 | +WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) | |
256 | +80000000000000000000000000000000000000000088ffff0000000000000000 | |
257 | + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | |
258 | + ^ | |
259 | + | |
260 | +Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A | |
261 | +RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 | |
262 | +RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 | |
263 | +RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 | |
264 | +RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 | |
265 | +RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 | |
266 | +R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e | |
267 | +R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 | |
268 | +FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 | |
269 | +CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 | |
270 | +CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 | |
271 | +DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | |
272 | +DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 | |
273 | + [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 | |
274 | + [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 | |
275 | + [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 | |
276 | + [<ffffffff8100c7b5>] int_signal+0x12/0x17 | |
277 | + [<ffffffffffffffff>] 0xffffffffffffffff | |
278 | + | |
279 | +The single most valuable information in this report is the RIP (or EIP on 32- | |
280 | +bit) value. This will help us pinpoint exactly which instruction that caused | |
281 | +the warning. | |
282 | + | |
283 | +If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do | |
284 | +is give this address to the addr2line program, like this: | |
285 | + | |
286 | + $ addr2line -e vmlinux -i ffffffff8104ede8 | |
287 | + arch/x86/include/asm/string_64.h:12 | |
288 | + include/asm-generic/siginfo.h:287 | |
289 | + kernel/signal.c:380 | |
290 | + kernel/signal.c:410 | |
291 | + | |
292 | +The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must | |
293 | +be the vmlinux of the kernel that produced the warning in the first place! If | |
294 | +not, the line number information will almost certainly be wrong. | |
295 | + | |
296 | +The "-i" tells addr2line to also print the line numbers of inlined functions. | |
297 | +In this case, the flag was very important, because otherwise, it would only | |
298 | +have printed the first line, which is just a call to memcpy(), which could be | |
299 | +called from a thousand places in the kernel, and is therefore not very useful. | |
300 | +These inlined functions would not show up in the stack trace above, simply | |
301 | +because the kernel doesn't load the extra debugging information. This | |
302 | +technique can of course be used with ordinary kernel oopses as well. | |
303 | + | |
304 | +In this case, it's the caller of memcpy() that is interesting, and it can be | |
305 | +found in include/asm-generic/siginfo.h, line 287: | |
306 | + | |
307 | +281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) | |
308 | +282 { | |
309 | +283 if (from->si_code < 0) | |
310 | +284 memcpy(to, from, sizeof(*to)); | |
311 | +285 else | |
312 | +286 /* _sigchld is currently the largest know union member */ | |
313 | +287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); | |
314 | +288 } | |
315 | + | |
316 | +Since this was a read (kmemcheck usually warns about reads only, though it can | |
317 | +warn about writes to unallocated or freed memory as well), it was probably the | |
318 | +"from" argument which contained some uninitialized bytes. Following the chain | |
319 | +of calls, we move upwards to see where "from" was allocated or initialized, | |
320 | +kernel/signal.c, line 380: | |
321 | + | |
322 | +359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) | |
323 | +360 { | |
324 | +... | |
325 | +367 list_for_each_entry(q, &list->list, list) { | |
326 | +368 if (q->info.si_signo == sig) { | |
327 | +369 if (first) | |
328 | +370 goto still_pending; | |
329 | +371 first = q; | |
330 | +... | |
331 | +377 if (first) { | |
332 | +378 still_pending: | |
333 | +379 list_del_init(&first->list); | |
334 | +380 copy_siginfo(info, &first->info); | |
335 | +381 __sigqueue_free(first); | |
336 | +... | |
337 | +392 } | |
338 | +393 } | |
339 | + | |
340 | +Here, it is &first->info that is being passed on to copy_siginfo(). The | |
341 | +variable "first" was found on a list -- passed in as the second argument to | |
342 | +collect_signal(). We continue our journey through the stack, to figure out | |
343 | +where the item on "list" was allocated or initialized. We move to line 410: | |
344 | + | |
345 | +395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, | |
346 | +396 siginfo_t *info) | |
347 | +397 { | |
348 | +... | |
349 | +410 collect_signal(sig, pending, info); | |
350 | +... | |
351 | +414 } | |
352 | + | |
353 | +Now we need to follow the "pending" pointer, since that is being passed on to | |
354 | +collect_signal() as "list". At this point, we've run out of lines from the | |
355 | +"addr2line" output. Not to worry, we just paste the next addresses from the | |
356 | +kmemcheck stack dump, i.e.: | |
357 | + | |
358 | + [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 | |
359 | + [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 | |
360 | + [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 | |
361 | + [<ffffffff8100c7b5>] int_signal+0x12/0x17 | |
362 | + | |
363 | + $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ | |
364 | + ffffffff8100b87d ffffffff8100c7b5 | |
365 | + kernel/signal.c:446 | |
366 | + kernel/signal.c:1806 | |
367 | + arch/x86/kernel/signal.c:805 | |
368 | + arch/x86/kernel/signal.c:871 | |
369 | + arch/x86/kernel/entry_64.S:694 | |
370 | + | |
371 | +Remember that since these addresses were found on the stack and not as the | |
372 | +RIP value, they actually point to the _next_ instruction (they are return | |
373 | +addresses). This becomes obvious when we look at the code for line 446: | |
374 | + | |
375 | +422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) | |
376 | +423 { | |
377 | +... | |
378 | +431 signr = __dequeue_signal(&tsk->signal->shared_pending, | |
379 | +432 mask, info); | |
380 | +433 /* | |
381 | +434 * itimer signal ? | |
382 | +435 * | |
383 | +436 * itimers are process shared and we restart periodic | |
384 | +437 * itimers in the signal delivery path to prevent DoS | |
385 | +438 * attacks in the high resolution timer case. This is | |
386 | +439 * compliant with the old way of self restarting | |
387 | +440 * itimers, as the SIGALRM is a legacy signal and only | |
388 | +441 * queued once. Changing the restart behaviour to | |
389 | +442 * restart the timer in the signal dequeue path is | |
390 | +443 * reducing the timer noise on heavy loaded !highres | |
391 | +444 * systems too. | |
392 | +445 */ | |
393 | +446 if (unlikely(signr == SIGALRM)) { | |
394 | +... | |
395 | +489 } | |
396 | + | |
397 | +So instead of looking at 446, we should be looking at 431, which is the line | |
398 | +that executes just before 446. Here we see that what we are looking for is | |
399 | +&tsk->signal->shared_pending. | |
400 | + | |
401 | +Our next task is now to figure out which function that puts items on this | |
402 | +"shared_pending" list. A crude, but efficient tool, is git grep: | |
403 | + | |
404 | + $ git grep -n 'shared_pending' kernel/ | |
405 | + ... | |
406 | + kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; | |
407 | + kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; | |
408 | + ... | |
409 | + | |
410 | +There were more results, but none of them were related to list operations, | |
411 | +and these were the only assignments. We inspect the line numbers more closely | |
412 | +and find that this is indeed where items are being added to the list: | |
413 | + | |
414 | +816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, | |
415 | +817 int group) | |
416 | +818 { | |
417 | +... | |
418 | +828 pending = group ? &t->signal->shared_pending : &t->pending; | |
419 | +... | |
420 | +851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && | |
421 | +852 (is_si_special(info) || | |
422 | +853 info->si_code >= 0))); | |
423 | +854 if (q) { | |
424 | +855 list_add_tail(&q->list, &pending->list); | |
425 | +... | |
426 | +890 } | |
427 | + | |
428 | +and: | |
429 | + | |
430 | +1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) | |
431 | +1310 { | |
432 | +.... | |
433 | +1339 pending = group ? &t->signal->shared_pending : &t->pending; | |
434 | +1340 list_add_tail(&q->list, &pending->list); | |
435 | +.... | |
436 | +1347 } | |
437 | + | |
438 | +In the first case, the list element we are looking for, "q", is being returned | |
439 | +from the function __sigqueue_alloc(), which looks like an allocation function. | |
440 | +Let's take a look at it: | |
441 | + | |
442 | +187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, | |
443 | +188 int override_rlimit) | |
444 | +189 { | |
445 | +190 struct sigqueue *q = NULL; | |
446 | +191 struct user_struct *user; | |
447 | +192 | |
448 | +193 /* | |
449 | +194 * We won't get problems with the target's UID changing under us | |
450 | +195 * because changing it requires RCU be used, and if t != current, the | |
451 | +196 * caller must be holding the RCU readlock (by way of a spinlock) and | |
452 | +197 * we use RCU protection here | |
453 | +198 */ | |
454 | +199 user = get_uid(__task_cred(t)->user); | |
455 | +200 atomic_inc(&user->sigpending); | |
456 | +201 if (override_rlimit || | |
457 | +202 atomic_read(&user->sigpending) <= | |
458 | +203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) | |
459 | +204 q = kmem_cache_alloc(sigqueue_cachep, flags); | |
460 | +205 if (unlikely(q == NULL)) { | |
461 | +206 atomic_dec(&user->sigpending); | |
462 | +207 free_uid(user); | |
463 | +208 } else { | |
464 | +209 INIT_LIST_HEAD(&q->list); | |
465 | +210 q->flags = 0; | |
466 | +211 q->user = user; | |
467 | +212 } | |
468 | +213 | |
469 | +214 return q; | |
470 | +215 } | |
471 | + | |
472 | +We see that this function initializes q->list, q->flags, and q->user. It seems | |
473 | +that now is the time to look at the definition of "struct sigqueue", e.g.: | |
474 | + | |
475 | +14 struct sigqueue { | |
476 | +15 struct list_head list; | |
477 | +16 int flags; | |
478 | +17 siginfo_t info; | |
479 | +18 struct user_struct *user; | |
480 | +19 }; | |
481 | + | |
482 | +And, you might remember, it was a memcpy() on &first->info that caused the | |
483 | +warning, so this makes perfect sense. It also seems reasonable to assume that | |
484 | +it is the caller of __sigqueue_alloc() that has the responsibility of filling | |
485 | +out (initializing) this member. | |
486 | + | |
487 | +But just which fields of the struct were uninitialized? Let's look at | |
488 | +kmemcheck's report again: | |
489 | + | |
490 | +WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) | |
491 | +80000000000000000000000000000000000000000088ffff0000000000000000 | |
492 | + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | |
493 | + ^ | |
494 | + | |
495 | +These first two lines are the memory dump of the memory object itself, and the | |
496 | +shadow bytemap, respectively. The memory object itself is in this case | |
497 | +&first->info. Just beware that the start of this dump is NOT the start of the | |
498 | +object itself! The position of the caret (^) corresponds with the address of | |
499 | +the read (ffff88003e4a2024). | |
500 | + | |
501 | +The shadow bytemap dump legend is as follows: | |
502 | + | |
503 | + i - initialized | |
504 | + u - uninitialized | |
505 | + a - unallocated (memory has been allocated by the slab layer, but has not | |
506 | + yet been handed off to anybody) | |
507 | + f - freed (memory has been allocated by the slab layer, but has been freed | |
508 | + by the previous owner) | |
509 | + | |
510 | +In order to figure out where (relative to the start of the object) the | |
511 | +uninitialized memory was located, we have to look at the disassembly. For | |
512 | +that, we'll need the RIP address again: | |
513 | + | |
514 | +RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 | |
515 | + | |
516 | + $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: | |
517 | + ffffffff8104edc8: mov %r8,0x8(%r8) | |
518 | + ffffffff8104edcc: test %r10d,%r10d | |
519 | + ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> | |
520 | + ffffffff8104edd5: mov %rax,%rdx | |
521 | + ffffffff8104edd8: mov $0xc,%ecx | |
522 | + ffffffff8104eddd: mov %r13,%rdi | |
523 | + ffffffff8104ede0: mov $0x30,%eax | |
524 | + ffffffff8104ede5: mov %rdx,%rsi | |
525 | + ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) | |
526 | + ffffffff8104edea: test $0x2,%al | |
527 | + ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> | |
528 | + ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) | |
529 | + ffffffff8104edf0: test $0x1,%al | |
530 | + ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> | |
531 | + ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) | |
532 | + ffffffff8104edf5: mov %r8,%rdi | |
533 | + ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> | |
534 | + | |
535 | +As expected, it's the "rep movsl" instruction from the memcpy() that causes | |
536 | +the warning. We know about REP MOVSL that it uses the register RCX to count | |
537 | +the number of remaining iterations. By taking a look at the register dump | |
538 | +again (from the kmemcheck report), we can figure out how many bytes were left | |
539 | +to copy: | |
540 | + | |
541 | +RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 | |
542 | + | |
543 | +By looking at the disassembly, we also see that %ecx is being loaded with the | |
544 | +value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind | |
545 | +that this is the number of iterations, not bytes. And since this is a "long" | |
546 | +operation, we need to multiply by 4 to get the number of bytes. So this means | |
547 | +that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes | |
548 | +from the start of the object. | |
549 | + | |
550 | +We can now try to figure out which field of the "struct siginfo" that was not | |
551 | +initialized. This is the beginning of the struct: | |
552 | + | |
553 | +40 typedef struct siginfo { | |
554 | +41 int si_signo; | |
555 | +42 int si_errno; | |
556 | +43 int si_code; | |
557 | +44 | |
558 | +45 union { | |
559 | +.. | |
560 | +92 } _sifields; | |
561 | +93 } siginfo_t; | |
562 | + | |
563 | +On 64-bit, the int is 4 bytes long, so it must the the union member that has | |
564 | +not been initialized. We can verify this using gdb: | |
565 | + | |
566 | + $ gdb vmlinux | |
567 | + ... | |
568 | + (gdb) p &((struct siginfo *) 0)->_sifields | |
569 | + $1 = (union {...} *) 0x10 | |
570 | + | |
571 | +Actually, it seems that the union member is located at offset 0x10 -- which | |
572 | +means that gcc has inserted 4 bytes of padding between the members si_code | |
573 | +and _sifields. We can now get a fuller picture of the memory dump: | |
574 | + | |
575 | + _----------------------------=> si_code | |
576 | + / _--------------------=> (padding) | |
577 | + | / _------------=> _sifields(._kill._pid) | |
578 | + | | / _----=> _sifields(._kill._uid) | |
579 | + | | | / | |
580 | +-------|-------|-------|-------| | |
581 | +80000000000000000000000000000000000000000088ffff0000000000000000 | |
582 | + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | |
583 | + | |
584 | +This allows us to realize another important fact: si_code contains the value | |
585 | +0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are | |
586 | +really the number 0x00000080. With a bit of research, we find that this is | |
587 | +actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h: | |
588 | + | |
589 | +144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ | |
590 | + | |
591 | +This macro is used in exactly one place in the x86 kernel: In send_signal() | |
592 | +in kernel/signal.c: | |
593 | + | |
594 | +816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, | |
595 | +817 int group) | |
596 | +818 { | |
597 | +... | |
598 | +828 pending = group ? &t->signal->shared_pending : &t->pending; | |
599 | +... | |
600 | +851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && | |
601 | +852 (is_si_special(info) || | |
602 | +853 info->si_code >= 0))); | |
603 | +854 if (q) { | |
604 | +855 list_add_tail(&q->list, &pending->list); | |
605 | +856 switch ((unsigned long) info) { | |
606 | +... | |
607 | +865 case (unsigned long) SEND_SIG_PRIV: | |
608 | +866 q->info.si_signo = sig; | |
609 | +867 q->info.si_errno = 0; | |
610 | +868 q->info.si_code = SI_KERNEL; | |
611 | +869 q->info.si_pid = 0; | |
612 | +870 q->info.si_uid = 0; | |
613 | +871 break; | |
614 | +... | |
615 | +890 } | |
616 | + | |
617 | +Not only does this match with the .si_code member, it also matches the place | |
618 | +we found earlier when looking for where siginfo_t objects are enqueued on the | |
619 | +"shared_pending" list. | |
620 | + | |
621 | +So to sum up: It seems that it is the padding introduced by the compiler | |
622 | +between two struct fields that is uninitialized, and this gets reported when | |
623 | +we do a memcpy() on the struct. This means that we have identified a false | |
624 | +positive warning. | |
625 | + | |
626 | +Normally, kmemcheck will not report uninitialized accesses in memcpy() calls | |
627 | +when both the source and destination addresses are tracked. (Instead, we copy | |
628 | +the shadow bytemap as well). In this case, the destination address clearly | |
629 | +was not tracked. We can dig a little deeper into the stack trace from above: | |
630 | + | |
631 | + arch/x86/kernel/signal.c:805 | |
632 | + arch/x86/kernel/signal.c:871 | |
633 | + arch/x86/kernel/entry_64.S:694 | |
634 | + | |
635 | +And we clearly see that the destination siginfo object is located on the | |
636 | +stack: | |
637 | + | |
638 | +782 static void do_signal(struct pt_regs *regs) | |
639 | +783 { | |
640 | +784 struct k_sigaction ka; | |
641 | +785 siginfo_t info; | |
642 | +... | |
643 | +804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); | |
644 | +... | |
645 | +854 } | |
646 | + | |
647 | +And this &info is what eventually gets passed to copy_siginfo() as the | |
648 | +destination argument. | |
649 | + | |
650 | +Now, even though we didn't find an actual error here, the example is still a | |
651 | +good one, because it shows how one would go about to find out what the report | |
652 | +was all about. | |
653 | + | |
654 | + | |
655 | +3.4. Annotating false positives | |
656 | +=============================== | |
657 | + | |
658 | +There are a few different ways to make annotations in the source code that | |
659 | +will keep kmemcheck from checking and reporting certain allocations. Here | |
660 | +they are: | |
661 | + | |
662 | + o __GFP_NOTRACK_FALSE_POSITIVE | |
663 | + | |
664 | + This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore | |
665 | + also to other functions that end up calling one of these) to indicate | |
666 | + that the allocation should not be tracked because it would lead to | |
667 | + a false positive report. This is a "big hammer" way of silencing | |
668 | + kmemcheck; after all, even if the false positive pertains to | |
669 | + particular field in a struct, for example, we will now lose the | |
670 | + ability to find (real) errors in other parts of the same struct. | |
671 | + | |
672 | + Example: | |
673 | + | |
674 | + /* No warnings will ever trigger on accessing any part of x */ | |
675 | + x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); | |
676 | + | |
677 | + o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and | |
678 | + kmemcheck_annotate_bitfield(ptr, name) | |
679 | + | |
680 | + The first two of these three macros can be used inside struct | |
681 | + definitions to signal, respectively, the beginning and end of a | |
682 | + bitfield. Additionally, this will assign the bitfield a name, which | |
683 | + is given as an argument to the macros. | |
684 | + | |
685 | + Having used these markers, one can later use | |
686 | + kmemcheck_annotate_bitfield() at the point of allocation, to indicate | |
687 | + which parts of the allocation is part of a bitfield. | |
688 | + | |
689 | + Example: | |
690 | + | |
691 | + struct foo { | |
692 | + int x; | |
693 | + | |
694 | + kmemcheck_bitfield_begin(flags); | |
695 | + int flag_a:1; | |
696 | + int flag_b:1; | |
697 | + kmemcheck_bitfield_end(flags); | |
698 | + | |
699 | + int y; | |
700 | + }; | |
701 | + | |
702 | + struct foo *x = kmalloc(sizeof *x); | |
703 | + | |
704 | + /* No warnings will trigger on accessing the bitfield of x */ | |
705 | + kmemcheck_annotate_bitfield(x, flags); | |
706 | + | |
707 | + Note that kmemcheck_annotate_bitfield() can be used even before the | |
708 | + return value of kmalloc() is checked -- in other words, passing NULL | |
709 | + as the first argument is legal (and will do nothing). | |
710 | + | |
711 | + | |
712 | +4. Reporting errors | |
713 | +=================== | |
714 | + | |
715 | +As we have seen, kmemcheck will produce false positive reports. Therefore, it | |
716 | +is not very wise to blindly post kmemcheck warnings to mailing lists and | |
717 | +maintainers. Instead, I encourage maintainers and developers to find errors | |
718 | +in their own code. If you get a warning, you can try to work around it, try | |
719 | +to figure out if it's a real error or not, or simply ignore it. Most | |
720 | +developers know their own code and will quickly and efficiently determine the | |
721 | +root cause of a kmemcheck report. This is therefore also the most efficient | |
722 | +way to work with kmemcheck. | |
723 | + | |
724 | +That said, we (the kmemcheck maintainers) will always be on the lookout for | |
725 | +false positives that we can annotate and silence. So whatever you find, | |
726 | +please drop us a note privately! Kernel configs and steps to reproduce (if | |
727 | +available) are of course a great help too. | |
728 | + | |
729 | +Happy hacking! | |
730 | + | |
731 | + | |
732 | +5. Technical description | |
733 | +======================== | |
734 | + | |
735 | +kmemcheck works by marking memory pages non-present. This means that whenever | |
736 | +somebody attempts to access the page, a page fault is generated. The page | |
737 | +fault handler notices that the page was in fact only hidden, and so it calls | |
738 | +on the kmemcheck code to make further investigations. | |
739 | + | |
740 | +When the investigations are completed, kmemcheck "shows" the page by marking | |
741 | +it present (as it would be under normal circumstances). This way, the | |
742 | +interrupted code can continue as usual. | |
743 | + | |
744 | +But after the instruction has been executed, we should hide the page again, so | |
745 | +that we can catch the next access too! Now kmemcheck makes use of a debugging | |
746 | +feature of the processor, namely single-stepping. When the processor has | |
747 | +finished the one instruction that generated the memory access, a debug | |
748 | +exception is raised. From here, we simply hide the page again and continue | |
749 | +execution, this time with the single-stepping feature turned off. | |
750 | + | |
751 | +kmemcheck requires some assistance from the memory allocator in order to work. | |
752 | +The memory allocator needs to | |
753 | + | |
754 | + 1. Tell kmemcheck about newly allocated pages and pages that are about to | |
755 | + be freed. This allows kmemcheck to set up and tear down the shadow memory | |
756 | + for the pages in question. The shadow memory stores the status of each | |
757 | + byte in the allocation proper, e.g. whether it is initialized or | |
758 | + uninitialized. | |
759 | + | |
760 | + 2. Tell kmemcheck which parts of memory should be marked uninitialized. | |
761 | + There are actually a few more states, such as "not yet allocated" and | |
762 | + "recently freed". | |
763 | + | |
764 | +If a slab cache is set up using the SLAB_NOTRACK flag, it will never return | |
765 | +memory that can take page faults because of kmemcheck. | |
766 | + | |
767 | +If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still | |
768 | +request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. | |
769 | +This does not prevent the page faults from occurring, however, but marks the | |
770 | +object in question as being initialized so that no warnings will ever be | |
771 | +produced for this object. | |
772 | + | |
773 | +Currently, the SLAB and SLUB allocators are supported by kmemcheck. |