Commit e594c8de3bd4e7732ed3340fb01e18ec94b12df2

Authored by Vegard Nossum
1 parent 456db8cc45

kmemcheck: add the kmemcheck documentation

Thanks to Sitsofe Wheeler, Randy Dunlap, and Jonathan Corbet for providing
input and feedback on this!

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>

Showing 1 changed file with 773 additions and 0 deletions Side-by-side Diff

Documentation/kmemcheck.txt
  1 +GETTING STARTED WITH KMEMCHECK
  2 +==============================
  3 +
  4 +Vegard Nossum <vegardno@ifi.uio.no>
  5 +
  6 +
  7 +Contents
  8 +========
  9 +0. Introduction
  10 +1. Downloading
  11 +2. Configuring and compiling
  12 +3. How to use
  13 +3.1. Booting
  14 +3.2. Run-time enable/disable
  15 +3.3. Debugging
  16 +3.4. Annotating false positives
  17 +4. Reporting errors
  18 +5. Technical description
  19 +
  20 +
  21 +0. Introduction
  22 +===============
  23 +
  24 +kmemcheck is a debugging feature for the Linux Kernel. More specifically, it
  25 +is a dynamic checker that detects and warns about some uses of uninitialized
  26 +memory.
  27 +
  28 +Userspace programmers might be familiar with Valgrind's memcheck. The main
  29 +difference between memcheck and kmemcheck is that memcheck works for userspace
  30 +programs only, and kmemcheck works for the kernel only. The implementations
  31 +are of course vastly different. Because of this, kmemcheck is not as accurate
  32 +as memcheck, but it turns out to be good enough in practice to discover real
  33 +programmer errors that the compiler is not able to find through static
  34 +analysis.
  35 +
  36 +Enabling kmemcheck on a kernel will probably slow it down to the extent that
  37 +the machine will not be usable for normal workloads such as e.g. an
  38 +interactive desktop. kmemcheck will also cause the kernel to use about twice
  39 +as much memory as normal. For this reason, kmemcheck is strictly a debugging
  40 +feature.
  41 +
  42 +
  43 +1. Downloading
  44 +==============
  45 +
  46 +kmemcheck can only be downloaded using git. If you want to write patches
  47 +against the current code, you should use the kmemcheck development branch of
  48 +the tip tree. It is also possible to use the linux-next tree, which also
  49 +includes the latest version of kmemcheck.
  50 +
  51 +Assuming that you've already cloned the linux-2.6.git repository, all you
  52 +have to do is add the -tip tree as a remote, like this:
  53 +
  54 + $ git remote add tip git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git
  55 +
  56 +To actually download the tree, fetch the remote:
  57 +
  58 + $ git fetch tip
  59 +
  60 +And to check out a new local branch with the kmemcheck code:
  61 +
  62 + $ git checkout -b kmemcheck tip/kmemcheck
  63 +
  64 +General instructions for the -tip tree can be found here:
  65 +http://people.redhat.com/mingo/tip.git/readme.txt
  66 +
  67 +
  68 +2. Configuring and compiling
  69 +============================
  70 +
  71 +kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of
  72 +configuration variables must have specific settings in order for the kmemcheck
  73 +menu to even appear in "menuconfig". These are:
  74 +
  75 + o CONFIG_CC_OPTIMIZE_FOR_SIZE=n
  76 +
  77 + This option is located under "General setup" / "Optimize for size".
  78 +
  79 + Without this, gcc will use certain optimizations that usually lead to
  80 + false positive warnings from kmemcheck. An example of this is a 16-bit
  81 + field in a struct, where gcc may load 32 bits, then discard the upper
  82 + 16 bits. kmemcheck sees only the 32-bit load, and may trigger a
  83 + warning for the upper 16 bits (if they're uninitialized).
  84 +
  85 + o CONFIG_SLAB=y or CONFIG_SLUB=y
  86 +
  87 + This option is located under "General setup" / "Choose SLAB
  88 + allocator".
  89 +
  90 + o CONFIG_FUNCTION_TRACER=n
  91 +
  92 + This option is located under "Kernel hacking" / "Tracers" / "Kernel
  93 + Function Tracer"
  94 +
  95 + When function tracing is compiled in, gcc emits a call to another
  96 + function at the beginning of every function. This means that when the
  97 + page fault handler is called, the ftrace framework will be called
  98 + before kmemcheck has had a chance to handle the fault. If ftrace then
  99 + modifies memory that was tracked by kmemcheck, the result is an
  100 + endless recursive page fault.
  101 +
  102 + o CONFIG_DEBUG_PAGEALLOC=n
  103 +
  104 + This option is located under "Kernel hacking" / "Debug page memory
  105 + allocations".
  106 +
  107 +In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also
  108 +located under "Kernel hacking". With this, you will be able to get line number
  109 +information from the kmemcheck warnings, which is extremely valuable in
  110 +debugging a problem. This option is not mandatory, however, because it slows
  111 +down the compilation process and produces a much bigger kernel image.
  112 +
  113 +Now the kmemcheck menu should be visible (under "Kernel hacking" / "kmemcheck:
  114 +trap use of uninitialized memory"). Here follows a description of the
  115 +kmemcheck configuration variables:
  116 +
  117 + o CONFIG_KMEMCHECK
  118 +
  119 + This must be enabled in order to use kmemcheck at all...
  120 +
  121 + o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT
  122 +
  123 + This option controls the status of kmemcheck at boot-time. "Enabled"
  124 + will enable kmemcheck right from the start, "disabled" will boot the
  125 + kernel as normal (but with the kmemcheck code compiled in, so it can
  126 + be enabled at run-time after the kernel has booted), and "one-shot" is
  127 + a special mode which will turn kmemcheck off automatically after
  128 + detecting the first use of uninitialized memory.
  129 +
  130 + If you are using kmemcheck to actively debug a problem, then you
  131 + probably want to choose "enabled" here.
  132 +
  133 + The one-shot mode is mostly useful in automated test setups because it
  134 + can prevent floods of warnings and increase the chances of the machine
  135 + surviving in case something is really wrong. In other cases, the one-
  136 + shot mode could actually be counter-productive because it would turn
  137 + itself off at the very first error -- in the case of a false positive
  138 + too -- and this would come in the way of debugging the specific
  139 + problem you were interested in.
  140 +
  141 + If you would like to use your kernel as normal, but with a chance to
  142 + enable kmemcheck in case of some problem, it might be a good idea to
  143 + choose "disabled" here. When kmemcheck is disabled, most of the run-
  144 + time overhead is not incurred, and the kernel will be almost as fast
  145 + as normal.
  146 +
  147 + o CONFIG_KMEMCHECK_QUEUE_SIZE
  148 +
  149 + Select the maximum number of error reports to store in an internal
  150 + (fixed-size) buffer. Since errors can occur virtually anywhere and in
  151 + any context, we need a temporary storage area which is guaranteed not
  152 + to generate any other page faults when accessed. The queue will be
  153 + emptied as soon as a tasklet may be scheduled. If the queue is full,
  154 + new error reports will be lost.
  155 +
  156 + The default value of 64 is probably fine. If some code produces more
  157 + than 64 errors within an irqs-off section, then the code is likely to
  158 + produce many, many more, too, and these additional reports seldom give
  159 + any more information (the first report is usually the most valuable
  160 + anyway).
  161 +
  162 + This number might have to be adjusted if you are not using serial
  163 + console or similar to capture the kernel log. If you are using the
  164 + "dmesg" command to save the log, then getting a lot of kmemcheck
  165 + warnings might overflow the kernel log itself, and the earlier reports
  166 + will get lost in that way instead. Try setting this to 10 or so on
  167 + such a setup.
  168 +
  169 + o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT
  170 +
  171 + Select the number of shadow bytes to save along with each entry of the
  172 + error-report queue. These bytes indicate what parts of an allocation
  173 + are initialized, uninitialized, etc. and will be displayed when an
  174 + error is detected to help the debugging of a particular problem.
  175 +
  176 + The number entered here is actually the logarithm of the number of
  177 + bytes that will be saved. So if you pick for example 5 here, kmemcheck
  178 + will save 2^5 = 32 bytes.
  179 +
  180 + The default value should be fine for debugging most problems. It also
  181 + fits nicely within 80 columns.
  182 +
  183 + o CONFIG_KMEMCHECK_PARTIAL_OK
  184 +
  185 + This option (when enabled) works around certain GCC optimizations that
  186 + produce 32-bit reads from 16-bit variables where the upper 16 bits are
  187 + thrown away afterwards.
  188 +
  189 + The default value (enabled) is recommended. This may of course hide
  190 + some real errors, but disabling it would probably produce a lot of
  191 + false positives.
  192 +
  193 + o CONFIG_KMEMCHECK_BITOPS_OK
  194 +
  195 + This option silences warnings that would be generated for bit-field
  196 + accesses where not all the bits are initialized at the same time. This
  197 + may also hide some real bugs.
  198 +
  199 + This option is probably obsolete, or it should be replaced with
  200 + the kmemcheck-/bitfield-annotations for the code in question. The
  201 + default value is therefore fine.
  202 +
  203 +Now compile the kernel as usual.
  204 +
  205 +
  206 +3. How to use
  207 +=============
  208 +
  209 +3.1. Booting
  210 +============
  211 +
  212 +First some information about the command-line options. There is only one
  213 +option specific to kmemcheck, and this is called "kmemcheck". It can be used
  214 +to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT
  215 +option. Its possible settings are:
  216 +
  217 + o kmemcheck=0 (disabled)
  218 + o kmemcheck=1 (enabled)
  219 + o kmemcheck=2 (one-shot mode)
  220 +
  221 +If SLUB debugging has been enabled in the kernel, it may take precedence over
  222 +kmemcheck in such a way that the slab caches which are under SLUB debugging
  223 +will not be tracked by kmemcheck. In order to ensure that this doesn't happen
  224 +(even though it shouldn't by default), use SLUB's boot option "slub_debug",
  225 +like this: slub_debug=-
  226 +
  227 +In fact, this option may also be used for fine-grained control over SLUB vs.
  228 +kmemcheck. For example, if the command line includes "kmemcheck=1
  229 +slub_debug=,dentry", then SLUB debugging will be used only for the "dentry"
  230 +slab cache, and with kmemcheck tracking all the other caches. This is advanced
  231 +usage, however, and is not generally recommended.
  232 +
  233 +
  234 +3.2. Run-time enable/disable
  235 +============================
  236 +
  237 +When the kernel has booted, it is possible to enable or disable kmemcheck at
  238 +run-time. WARNING: This feature is still experimental and may cause false
  239 +positive warnings to appear. Therefore, try not to use this. If you find that
  240 +it doesn't work properly (e.g. you see an unreasonable amount of warnings), I
  241 +will be happy to take bug reports.
  242 +
  243 +Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.:
  244 +
  245 + $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck
  246 +
  247 +The numbers are the same as for the kmemcheck= command-line option.
  248 +
  249 +
  250 +3.3. Debugging
  251 +==============
  252 +
  253 +A typical report will look something like this:
  254 +
  255 +WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024)
  256 +80000000000000000000000000000000000000000088ffff0000000000000000
  257 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
  258 + ^
  259 +
  260 +Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A
  261 +RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190
  262 +RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002
  263 +RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009
  264 +RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84
  265 +RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000
  266 +R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e
  267 +R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8
  268 +FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000
  269 +CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
  270 +CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0
  271 +DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  272 +DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
  273 + [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170
  274 + [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390
  275 + [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0
  276 + [<ffffffff8100c7b5>] int_signal+0x12/0x17
  277 + [<ffffffffffffffff>] 0xffffffffffffffff
  278 +
  279 +The single most valuable information in this report is the RIP (or EIP on 32-
  280 +bit) value. This will help us pinpoint exactly which instruction that caused
  281 +the warning.
  282 +
  283 +If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do
  284 +is give this address to the addr2line program, like this:
  285 +
  286 + $ addr2line -e vmlinux -i ffffffff8104ede8
  287 + arch/x86/include/asm/string_64.h:12
  288 + include/asm-generic/siginfo.h:287
  289 + kernel/signal.c:380
  290 + kernel/signal.c:410
  291 +
  292 +The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must
  293 +be the vmlinux of the kernel that produced the warning in the first place! If
  294 +not, the line number information will almost certainly be wrong.
  295 +
  296 +The "-i" tells addr2line to also print the line numbers of inlined functions.
  297 +In this case, the flag was very important, because otherwise, it would only
  298 +have printed the first line, which is just a call to memcpy(), which could be
  299 +called from a thousand places in the kernel, and is therefore not very useful.
  300 +These inlined functions would not show up in the stack trace above, simply
  301 +because the kernel doesn't load the extra debugging information. This
  302 +technique can of course be used with ordinary kernel oopses as well.
  303 +
  304 +In this case, it's the caller of memcpy() that is interesting, and it can be
  305 +found in include/asm-generic/siginfo.h, line 287:
  306 +
  307 +281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)
  308 +282 {
  309 +283 if (from->si_code < 0)
  310 +284 memcpy(to, from, sizeof(*to));
  311 +285 else
  312 +286 /* _sigchld is currently the largest know union member */
  313 +287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld));
  314 +288 }
  315 +
  316 +Since this was a read (kmemcheck usually warns about reads only, though it can
  317 +warn about writes to unallocated or freed memory as well), it was probably the
  318 +"from" argument which contained some uninitialized bytes. Following the chain
  319 +of calls, we move upwards to see where "from" was allocated or initialized,
  320 +kernel/signal.c, line 380:
  321 +
  322 +359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
  323 +360 {
  324 +...
  325 +367 list_for_each_entry(q, &list->list, list) {
  326 +368 if (q->info.si_signo == sig) {
  327 +369 if (first)
  328 +370 goto still_pending;
  329 +371 first = q;
  330 +...
  331 +377 if (first) {
  332 +378 still_pending:
  333 +379 list_del_init(&first->list);
  334 +380 copy_siginfo(info, &first->info);
  335 +381 __sigqueue_free(first);
  336 +...
  337 +392 }
  338 +393 }
  339 +
  340 +Here, it is &first->info that is being passed on to copy_siginfo(). The
  341 +variable "first" was found on a list -- passed in as the second argument to
  342 +collect_signal(). We continue our journey through the stack, to figure out
  343 +where the item on "list" was allocated or initialized. We move to line 410:
  344 +
  345 +395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
  346 +396 siginfo_t *info)
  347 +397 {
  348 +...
  349 +410 collect_signal(sig, pending, info);
  350 +...
  351 +414 }
  352 +
  353 +Now we need to follow the "pending" pointer, since that is being passed on to
  354 +collect_signal() as "list". At this point, we've run out of lines from the
  355 +"addr2line" output. Not to worry, we just paste the next addresses from the
  356 +kmemcheck stack dump, i.e.:
  357 +
  358 + [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170
  359 + [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390
  360 + [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0
  361 + [<ffffffff8100c7b5>] int_signal+0x12/0x17
  362 +
  363 + $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \
  364 + ffffffff8100b87d ffffffff8100c7b5
  365 + kernel/signal.c:446
  366 + kernel/signal.c:1806
  367 + arch/x86/kernel/signal.c:805
  368 + arch/x86/kernel/signal.c:871
  369 + arch/x86/kernel/entry_64.S:694
  370 +
  371 +Remember that since these addresses were found on the stack and not as the
  372 +RIP value, they actually point to the _next_ instruction (they are return
  373 +addresses). This becomes obvious when we look at the code for line 446:
  374 +
  375 +422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  376 +423 {
  377 +...
  378 +431 signr = __dequeue_signal(&tsk->signal->shared_pending,
  379 +432 mask, info);
  380 +433 /*
  381 +434 * itimer signal ?
  382 +435 *
  383 +436 * itimers are process shared and we restart periodic
  384 +437 * itimers in the signal delivery path to prevent DoS
  385 +438 * attacks in the high resolution timer case. This is
  386 +439 * compliant with the old way of self restarting
  387 +440 * itimers, as the SIGALRM is a legacy signal and only
  388 +441 * queued once. Changing the restart behaviour to
  389 +442 * restart the timer in the signal dequeue path is
  390 +443 * reducing the timer noise on heavy loaded !highres
  391 +444 * systems too.
  392 +445 */
  393 +446 if (unlikely(signr == SIGALRM)) {
  394 +...
  395 +489 }
  396 +
  397 +So instead of looking at 446, we should be looking at 431, which is the line
  398 +that executes just before 446. Here we see that what we are looking for is
  399 +&tsk->signal->shared_pending.
  400 +
  401 +Our next task is now to figure out which function that puts items on this
  402 +"shared_pending" list. A crude, but efficient tool, is git grep:
  403 +
  404 + $ git grep -n 'shared_pending' kernel/
  405 + ...
  406 + kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending;
  407 + kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending;
  408 + ...
  409 +
  410 +There were more results, but none of them were related to list operations,
  411 +and these were the only assignments. We inspect the line numbers more closely
  412 +and find that this is indeed where items are being added to the list:
  413 +
  414 +816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
  415 +817 int group)
  416 +818 {
  417 +...
  418 +828 pending = group ? &t->signal->shared_pending : &t->pending;
  419 +...
  420 +851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN &&
  421 +852 (is_si_special(info) ||
  422 +853 info->si_code >= 0)));
  423 +854 if (q) {
  424 +855 list_add_tail(&q->list, &pending->list);
  425 +...
  426 +890 }
  427 +
  428 +and:
  429 +
  430 +1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
  431 +1310 {
  432 +....
  433 +1339 pending = group ? &t->signal->shared_pending : &t->pending;
  434 +1340 list_add_tail(&q->list, &pending->list);
  435 +....
  436 +1347 }
  437 +
  438 +In the first case, the list element we are looking for, "q", is being returned
  439 +from the function __sigqueue_alloc(), which looks like an allocation function.
  440 +Let's take a look at it:
  441 +
  442 +187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags,
  443 +188 int override_rlimit)
  444 +189 {
  445 +190 struct sigqueue *q = NULL;
  446 +191 struct user_struct *user;
  447 +192
  448 +193 /*
  449 +194 * We won't get problems with the target's UID changing under us
  450 +195 * because changing it requires RCU be used, and if t != current, the
  451 +196 * caller must be holding the RCU readlock (by way of a spinlock) and
  452 +197 * we use RCU protection here
  453 +198 */
  454 +199 user = get_uid(__task_cred(t)->user);
  455 +200 atomic_inc(&user->sigpending);
  456 +201 if (override_rlimit ||
  457 +202 atomic_read(&user->sigpending) <=
  458 +203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur)
  459 +204 q = kmem_cache_alloc(sigqueue_cachep, flags);
  460 +205 if (unlikely(q == NULL)) {
  461 +206 atomic_dec(&user->sigpending);
  462 +207 free_uid(user);
  463 +208 } else {
  464 +209 INIT_LIST_HEAD(&q->list);
  465 +210 q->flags = 0;
  466 +211 q->user = user;
  467 +212 }
  468 +213
  469 +214 return q;
  470 +215 }
  471 +
  472 +We see that this function initializes q->list, q->flags, and q->user. It seems
  473 +that now is the time to look at the definition of "struct sigqueue", e.g.:
  474 +
  475 +14 struct sigqueue {
  476 +15 struct list_head list;
  477 +16 int flags;
  478 +17 siginfo_t info;
  479 +18 struct user_struct *user;
  480 +19 };
  481 +
  482 +And, you might remember, it was a memcpy() on &first->info that caused the
  483 +warning, so this makes perfect sense. It also seems reasonable to assume that
  484 +it is the caller of __sigqueue_alloc() that has the responsibility of filling
  485 +out (initializing) this member.
  486 +
  487 +But just which fields of the struct were uninitialized? Let's look at
  488 +kmemcheck's report again:
  489 +
  490 +WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024)
  491 +80000000000000000000000000000000000000000088ffff0000000000000000
  492 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
  493 + ^
  494 +
  495 +These first two lines are the memory dump of the memory object itself, and the
  496 +shadow bytemap, respectively. The memory object itself is in this case
  497 +&first->info. Just beware that the start of this dump is NOT the start of the
  498 +object itself! The position of the caret (^) corresponds with the address of
  499 +the read (ffff88003e4a2024).
  500 +
  501 +The shadow bytemap dump legend is as follows:
  502 +
  503 + i - initialized
  504 + u - uninitialized
  505 + a - unallocated (memory has been allocated by the slab layer, but has not
  506 + yet been handed off to anybody)
  507 + f - freed (memory has been allocated by the slab layer, but has been freed
  508 + by the previous owner)
  509 +
  510 +In order to figure out where (relative to the start of the object) the
  511 +uninitialized memory was located, we have to look at the disassembly. For
  512 +that, we'll need the RIP address again:
  513 +
  514 +RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190
  515 +
  516 + $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8:
  517 + ffffffff8104edc8: mov %r8,0x8(%r8)
  518 + ffffffff8104edcc: test %r10d,%r10d
  519 + ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168>
  520 + ffffffff8104edd5: mov %rax,%rdx
  521 + ffffffff8104edd8: mov $0xc,%ecx
  522 + ffffffff8104eddd: mov %r13,%rdi
  523 + ffffffff8104ede0: mov $0x30,%eax
  524 + ffffffff8104ede5: mov %rdx,%rsi
  525 + ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi)
  526 + ffffffff8104edea: test $0x2,%al
  527 + ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0>
  528 + ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi)
  529 + ffffffff8104edf0: test $0x1,%al
  530 + ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5>
  531 + ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi)
  532 + ffffffff8104edf5: mov %r8,%rdi
  533 + ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free>
  534 +
  535 +As expected, it's the "rep movsl" instruction from the memcpy() that causes
  536 +the warning. We know about REP MOVSL that it uses the register RCX to count
  537 +the number of remaining iterations. By taking a look at the register dump
  538 +again (from the kmemcheck report), we can figure out how many bytes were left
  539 +to copy:
  540 +
  541 +RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009
  542 +
  543 +By looking at the disassembly, we also see that %ecx is being loaded with the
  544 +value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind
  545 +that this is the number of iterations, not bytes. And since this is a "long"
  546 +operation, we need to multiply by 4 to get the number of bytes. So this means
  547 +that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes
  548 +from the start of the object.
  549 +
  550 +We can now try to figure out which field of the "struct siginfo" that was not
  551 +initialized. This is the beginning of the struct:
  552 +
  553 +40 typedef struct siginfo {
  554 +41 int si_signo;
  555 +42 int si_errno;
  556 +43 int si_code;
  557 +44
  558 +45 union {
  559 +..
  560 +92 } _sifields;
  561 +93 } siginfo_t;
  562 +
  563 +On 64-bit, the int is 4 bytes long, so it must the the union member that has
  564 +not been initialized. We can verify this using gdb:
  565 +
  566 + $ gdb vmlinux
  567 + ...
  568 + (gdb) p &((struct siginfo *) 0)->_sifields
  569 + $1 = (union {...} *) 0x10
  570 +
  571 +Actually, it seems that the union member is located at offset 0x10 -- which
  572 +means that gcc has inserted 4 bytes of padding between the members si_code
  573 +and _sifields. We can now get a fuller picture of the memory dump:
  574 +
  575 + _----------------------------=> si_code
  576 + / _--------------------=> (padding)
  577 + | / _------------=> _sifields(._kill._pid)
  578 + | | / _----=> _sifields(._kill._uid)
  579 + | | | /
  580 +-------|-------|-------|-------|
  581 +80000000000000000000000000000000000000000088ffff0000000000000000
  582 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
  583 +
  584 +This allows us to realize another important fact: si_code contains the value
  585 +0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are
  586 +really the number 0x00000080. With a bit of research, we find that this is
  587 +actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h:
  588 +
  589 +144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */
  590 +
  591 +This macro is used in exactly one place in the x86 kernel: In send_signal()
  592 +in kernel/signal.c:
  593 +
  594 +816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
  595 +817 int group)
  596 +818 {
  597 +...
  598 +828 pending = group ? &t->signal->shared_pending : &t->pending;
  599 +...
  600 +851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN &&
  601 +852 (is_si_special(info) ||
  602 +853 info->si_code >= 0)));
  603 +854 if (q) {
  604 +855 list_add_tail(&q->list, &pending->list);
  605 +856 switch ((unsigned long) info) {
  606 +...
  607 +865 case (unsigned long) SEND_SIG_PRIV:
  608 +866 q->info.si_signo = sig;
  609 +867 q->info.si_errno = 0;
  610 +868 q->info.si_code = SI_KERNEL;
  611 +869 q->info.si_pid = 0;
  612 +870 q->info.si_uid = 0;
  613 +871 break;
  614 +...
  615 +890 }
  616 +
  617 +Not only does this match with the .si_code member, it also matches the place
  618 +we found earlier when looking for where siginfo_t objects are enqueued on the
  619 +"shared_pending" list.
  620 +
  621 +So to sum up: It seems that it is the padding introduced by the compiler
  622 +between two struct fields that is uninitialized, and this gets reported when
  623 +we do a memcpy() on the struct. This means that we have identified a false
  624 +positive warning.
  625 +
  626 +Normally, kmemcheck will not report uninitialized accesses in memcpy() calls
  627 +when both the source and destination addresses are tracked. (Instead, we copy
  628 +the shadow bytemap as well). In this case, the destination address clearly
  629 +was not tracked. We can dig a little deeper into the stack trace from above:
  630 +
  631 + arch/x86/kernel/signal.c:805
  632 + arch/x86/kernel/signal.c:871
  633 + arch/x86/kernel/entry_64.S:694
  634 +
  635 +And we clearly see that the destination siginfo object is located on the
  636 +stack:
  637 +
  638 +782 static void do_signal(struct pt_regs *regs)
  639 +783 {
  640 +784 struct k_sigaction ka;
  641 +785 siginfo_t info;
  642 +...
  643 +804 signr = get_signal_to_deliver(&info, &ka, regs, NULL);
  644 +...
  645 +854 }
  646 +
  647 +And this &info is what eventually gets passed to copy_siginfo() as the
  648 +destination argument.
  649 +
  650 +Now, even though we didn't find an actual error here, the example is still a
  651 +good one, because it shows how one would go about to find out what the report
  652 +was all about.
  653 +
  654 +
  655 +3.4. Annotating false positives
  656 +===============================
  657 +
  658 +There are a few different ways to make annotations in the source code that
  659 +will keep kmemcheck from checking and reporting certain allocations. Here
  660 +they are:
  661 +
  662 + o __GFP_NOTRACK_FALSE_POSITIVE
  663 +
  664 + This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore
  665 + also to other functions that end up calling one of these) to indicate
  666 + that the allocation should not be tracked because it would lead to
  667 + a false positive report. This is a "big hammer" way of silencing
  668 + kmemcheck; after all, even if the false positive pertains to
  669 + particular field in a struct, for example, we will now lose the
  670 + ability to find (real) errors in other parts of the same struct.
  671 +
  672 + Example:
  673 +
  674 + /* No warnings will ever trigger on accessing any part of x */
  675 + x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE);
  676 +
  677 + o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and
  678 + kmemcheck_annotate_bitfield(ptr, name)
  679 +
  680 + The first two of these three macros can be used inside struct
  681 + definitions to signal, respectively, the beginning and end of a
  682 + bitfield. Additionally, this will assign the bitfield a name, which
  683 + is given as an argument to the macros.
  684 +
  685 + Having used these markers, one can later use
  686 + kmemcheck_annotate_bitfield() at the point of allocation, to indicate
  687 + which parts of the allocation is part of a bitfield.
  688 +
  689 + Example:
  690 +
  691 + struct foo {
  692 + int x;
  693 +
  694 + kmemcheck_bitfield_begin(flags);
  695 + int flag_a:1;
  696 + int flag_b:1;
  697 + kmemcheck_bitfield_end(flags);
  698 +
  699 + int y;
  700 + };
  701 +
  702 + struct foo *x = kmalloc(sizeof *x);
  703 +
  704 + /* No warnings will trigger on accessing the bitfield of x */
  705 + kmemcheck_annotate_bitfield(x, flags);
  706 +
  707 + Note that kmemcheck_annotate_bitfield() can be used even before the
  708 + return value of kmalloc() is checked -- in other words, passing NULL
  709 + as the first argument is legal (and will do nothing).
  710 +
  711 +
  712 +4. Reporting errors
  713 +===================
  714 +
  715 +As we have seen, kmemcheck will produce false positive reports. Therefore, it
  716 +is not very wise to blindly post kmemcheck warnings to mailing lists and
  717 +maintainers. Instead, I encourage maintainers and developers to find errors
  718 +in their own code. If you get a warning, you can try to work around it, try
  719 +to figure out if it's a real error or not, or simply ignore it. Most
  720 +developers know their own code and will quickly and efficiently determine the
  721 +root cause of a kmemcheck report. This is therefore also the most efficient
  722 +way to work with kmemcheck.
  723 +
  724 +That said, we (the kmemcheck maintainers) will always be on the lookout for
  725 +false positives that we can annotate and silence. So whatever you find,
  726 +please drop us a note privately! Kernel configs and steps to reproduce (if
  727 +available) are of course a great help too.
  728 +
  729 +Happy hacking!
  730 +
  731 +
  732 +5. Technical description
  733 +========================
  734 +
  735 +kmemcheck works by marking memory pages non-present. This means that whenever
  736 +somebody attempts to access the page, a page fault is generated. The page
  737 +fault handler notices that the page was in fact only hidden, and so it calls
  738 +on the kmemcheck code to make further investigations.
  739 +
  740 +When the investigations are completed, kmemcheck "shows" the page by marking
  741 +it present (as it would be under normal circumstances). This way, the
  742 +interrupted code can continue as usual.
  743 +
  744 +But after the instruction has been executed, we should hide the page again, so
  745 +that we can catch the next access too! Now kmemcheck makes use of a debugging
  746 +feature of the processor, namely single-stepping. When the processor has
  747 +finished the one instruction that generated the memory access, a debug
  748 +exception is raised. From here, we simply hide the page again and continue
  749 +execution, this time with the single-stepping feature turned off.
  750 +
  751 +kmemcheck requires some assistance from the memory allocator in order to work.
  752 +The memory allocator needs to
  753 +
  754 + 1. Tell kmemcheck about newly allocated pages and pages that are about to
  755 + be freed. This allows kmemcheck to set up and tear down the shadow memory
  756 + for the pages in question. The shadow memory stores the status of each
  757 + byte in the allocation proper, e.g. whether it is initialized or
  758 + uninitialized.
  759 +
  760 + 2. Tell kmemcheck which parts of memory should be marked uninitialized.
  761 + There are actually a few more states, such as "not yet allocated" and
  762 + "recently freed".
  763 +
  764 +If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
  765 +memory that can take page faults because of kmemcheck.
  766 +
  767 +If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
  768 +request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags.
  769 +This does not prevent the page faults from occurring, however, but marks the
  770 +object in question as being initialized so that no warnings will ever be
  771 +produced for this object.
  772 +
  773 +Currently, the SLAB and SLUB allocators are supported by kmemcheck.