Eric Lee / linux-smarc-t335x-v3.2

1

+GETTING STARTED WITH KMEMCHECK

2

+==============================

3

+

4

+Vegard Nossum <vegardno@ifi.uio.no>

+

+

+Contents

+========

+0. Introduction

+1. Downloading

+2. Configuring and compiling

12

+3. How to use

13

+3.1. Booting

14

+3.2. Run-time enable/disable

15

+3.3. Debugging

16

+3.4. Annotating false positives

17

+4. Reporting errors

18

+5. Technical description

+

+

+0. Introduction

+===============

+

+kmemcheck is a debugging feature for the Linux Kernel. More specifically, it

25

+is a dynamic checker that detects and warns about some uses of uninitialized

26

+memory.

27

+

28

+Userspace programmers might be familiar with Valgrind's memcheck. The main

29

+difference between memcheck and kmemcheck is that memcheck works for userspace

30

+programs only, and kmemcheck works for the kernel only. The implementations

31

+are of course vastly different. Because of this, kmemcheck is not as accurate

32

+as memcheck, but it turns out to be good enough in practice to discover real

33

+programmer errors that the compiler is not able to find through static

34

+analysis.

35

+

36

+Enabling kmemcheck on a kernel will probably slow it down to the extent that

37

+the machine will not be usable for normal workloads such as e.g. an

38

+interactive desktop. kmemcheck will also cause the kernel to use about twice

39

+as much memory as normal. For this reason, kmemcheck is strictly a debugging

+feature.

+

+

+1. Downloading

+==============

+

+kmemcheck can only be downloaded using git. If you want to write patches

47

+against the current code, you should use the kmemcheck development branch of

48

+the tip tree. It is also possible to use the linux-next tree, which also

49

+includes the latest version of kmemcheck.

50

+

51

+Assuming that you've already cloned the linux-2.6.git repository, all you

52

+have to do is add the -tip tree as a remote, like this:

53

+

54

+ $ git remote add tip git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git

55

+

56

+To actually download the tree, fetch the remote:

+

+ $ git fetch tip

+

+And to check out a new local branch with the kmemcheck code:

61

+

62

+ $ git checkout -b kmemcheck tip/kmemcheck

63

+

64

+General instructions for the -tip tree can be found here:

65

+http://people.redhat.com/mingo/tip.git/readme.txt

66

+

67

+

68

+2. Configuring and compiling

69

+============================

70

+

71

+kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of

72

+configuration variables must have specific settings in order for the kmemcheck

73

+menu to even appear in "menuconfig". These are:

74

+

75

+ o CONFIG_CC_OPTIMIZE_FOR_SIZE=n

76

+

77

+ This option is located under "General setup" / "Optimize for size".

78

+

79

+ Without this, gcc will use certain optimizations that usually lead to

80

+ false positive warnings from kmemcheck. An example of this is a 16-bit

81

+ field in a struct, where gcc may load 32 bits, then discard the upper

82

+ 16 bits. kmemcheck sees only the 32-bit load, and may trigger a

83

+ warning for the upper 16 bits (if they're uninitialized).

84

+

85

+ o CONFIG_SLAB=y or CONFIG_SLUB=y

86

+

87

+ This option is located under "General setup" / "Choose SLAB

88

+ allocator".

89

+

90

+ o CONFIG_FUNCTION_TRACER=n

91

+

92

+ This option is located under "Kernel hacking" / "Tracers" / "Kernel

93

+ Function Tracer"

94

+

95

+ When function tracing is compiled in, gcc emits a call to another

96

+ function at the beginning of every function. This means that when the

97

+ page fault handler is called, the ftrace framework will be called

98

+ before kmemcheck has had a chance to handle the fault. If ftrace then

99

+ modifies memory that was tracked by kmemcheck, the result is an

100

+ endless recursive page fault.

101

+

102

+ o CONFIG_DEBUG_PAGEALLOC=n

103

+

104

+ This option is located under "Kernel hacking" / "Debug page memory

105

+ allocations".

106

+

107

+In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also

108

+located under "Kernel hacking". With this, you will be able to get line number

109

+information from the kmemcheck warnings, which is extremely valuable in

110

+debugging a problem. This option is not mandatory, however, because it slows

111

+down the compilation process and produces a much bigger kernel image.

112

+

113

+Now the kmemcheck menu should be visible (under "Kernel hacking" / "kmemcheck:

114

+trap use of uninitialized memory"). Here follows a description of the

115

+kmemcheck configuration variables:

+

+ o CONFIG_KMEMCHECK

+

+ This must be enabled in order to use kmemcheck at all...

120

+

121

+ o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT

122

+

123

+ This option controls the status of kmemcheck at boot-time. "Enabled"

124

+ will enable kmemcheck right from the start, "disabled" will boot the

125

+ kernel as normal (but with the kmemcheck code compiled in, so it can

126

+ be enabled at run-time after the kernel has booted), and "one-shot" is

127

+ a special mode which will turn kmemcheck off automatically after

128

+ detecting the first use of uninitialized memory.

129

+

130

+ If you are using kmemcheck to actively debug a problem, then you

131

+ probably want to choose "enabled" here.

132

+

133

+ The one-shot mode is mostly useful in automated test setups because it

134

+ can prevent floods of warnings and increase the chances of the machine

135

+ surviving in case something is really wrong. In other cases, the one-

136

+ shot mode could actually be counter-productive because it would turn

137

+ itself off at the very first error -- in the case of a false positive

138

+ too -- and this would come in the way of debugging the specific

139

+ problem you were interested in.

140

+

141

+ If you would like to use your kernel as normal, but with a chance to

142

+ enable kmemcheck in case of some problem, it might be a good idea to

143

+ choose "disabled" here. When kmemcheck is disabled, most of the run-

144

+ time overhead is not incurred, and the kernel will be almost as fast

145

+ as normal.

146

+

147

+ o CONFIG_KMEMCHECK_QUEUE_SIZE

148

+

149

+ Select the maximum number of error reports to store in an internal

150

+ (fixed-size) buffer. Since errors can occur virtually anywhere and in

151

+ any context, we need a temporary storage area which is guaranteed not

152

+ to generate any other page faults when accessed. The queue will be

153

+ emptied as soon as a tasklet may be scheduled. If the queue is full,

154

+ new error reports will be lost.

155

+

156

+ The default value of 64 is probably fine. If some code produces more

157

+ than 64 errors within an irqs-off section, then the code is likely to

158

+ produce many, many more, too, and these additional reports seldom give

159

+ any more information (the first report is usually the most valuable

160

+ anyway).

161

+

162

+ This number might have to be adjusted if you are not using serial

163

+ console or similar to capture the kernel log. If you are using the

164

+ "dmesg" command to save the log, then getting a lot of kmemcheck

165

+ warnings might overflow the kernel log itself, and the earlier reports

166

+ will get lost in that way instead. Try setting this to 10 or so on

167

+ such a setup.

168

+

169

+ o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT

170

+

171

+ Select the number of shadow bytes to save along with each entry of the

172

+ error-report queue. These bytes indicate what parts of an allocation

173

+ are initialized, uninitialized, etc. and will be displayed when an

174

+ error is detected to help the debugging of a particular problem.

175

+

176

+ The number entered here is actually the logarithm of the number of

177

+ bytes that will be saved. So if you pick for example 5 here, kmemcheck

178

+ will save 2^5 = 32 bytes.

179

+

180

+ The default value should be fine for debugging most problems. It also

181

+ fits nicely within 80 columns.

182

+

183

+ o CONFIG_KMEMCHECK_PARTIAL_OK

184

+

185

+ This option (when enabled) works around certain GCC optimizations that

186

+ produce 32-bit reads from 16-bit variables where the upper 16 bits are

187

+ thrown away afterwards.

188

+

189

+ The default value (enabled) is recommended. This may of course hide

190

+ some real errors, but disabling it would probably produce a lot of

191

+ false positives.

192

+

193

+ o CONFIG_KMEMCHECK_BITOPS_OK

194

+

195

+ This option silences warnings that would be generated for bit-field

196

+ accesses where not all the bits are initialized at the same time. This

197

+ may also hide some real bugs.

198

+

199

+ This option is probably obsolete, or it should be replaced with

200

+ the kmemcheck-/bitfield-annotations for the code in question. The

201

+ default value is therefore fine.

202

+

203

+Now compile the kernel as usual.

+

+

+3. How to use

+=============

+

+3.1. Booting

+============

+

+First some information about the command-line options. There is only one

213

+option specific to kmemcheck, and this is called "kmemcheck". It can be used

214

+to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT

215

+option. Its possible settings are:

216

+

217

+ o kmemcheck=0 (disabled)

218

+ o kmemcheck=1 (enabled)

219

+ o kmemcheck=2 (one-shot mode)

220

+

221

+If SLUB debugging has been enabled in the kernel, it may take precedence over

222

+kmemcheck in such a way that the slab caches which are under SLUB debugging

223

+will not be tracked by kmemcheck. In order to ensure that this doesn't happen

224

+(even though it shouldn't by default), use SLUB's boot option "slub_debug",

225

+like this: slub_debug=-

226

+

227

+In fact, this option may also be used for fine-grained control over SLUB vs.

228

+kmemcheck. For example, if the command line includes "kmemcheck=1

229

+slub_debug=,dentry", then SLUB debugging will be used only for the "dentry"

230

+slab cache, and with kmemcheck tracking all the other caches. This is advanced

231

+usage, however, and is not generally recommended.

232

+

233

+

234

+3.2. Run-time enable/disable

235

+============================

236

+

237

+When the kernel has booted, it is possible to enable or disable kmemcheck at

238

+run-time. WARNING: This feature is still experimental and may cause false

239

+positive warnings to appear. Therefore, try not to use this. If you find that

240

+it doesn't work properly (e.g. you see an unreasonable amount of warnings), I

241

+will be happy to take bug reports.

242

+

243

+Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.:

244

+

245

+ $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck

246

+

247

+The numbers are the same as for the kmemcheck= command-line option.

+

+

+3.3. Debugging

+==============

+

+A typical report will look something like this:

254

+

255

+WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024)

256

+80000000000000000000000000000000000000000088ffff0000000000000000

257

+ i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u

258

+ ^

259

+

260

+Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A

261

+RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190

262

+RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002

263

+RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009

264

+RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84

265

+RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000

266

+R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e

267

+R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8

268

+FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000

269

+CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033

270

+CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0

271

+DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

272

+DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400

273

+ [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170

274

+ [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390

275

+ [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0

276

+ [<ffffffff8100c7b5>] int_signal+0x12/0x17

277

+ [<ffffffffffffffff>] 0xffffffffffffffff

278

+

279

+The single most valuable information in this report is the RIP (or EIP on 32-

280

+bit) value. This will help us pinpoint exactly which instruction that caused

281

+the warning.

282

+

283

+If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do

284

+is give this address to the addr2line program, like this:

285

+

286

+ $ addr2line -e vmlinux -i ffffffff8104ede8

287

+ arch/x86/include/asm/string_64.h:12

288

+ include/asm-generic/siginfo.h:287

289

+ kernel/signal.c:380

290

+ kernel/signal.c:410

291

+

292

+The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must

293

+be the vmlinux of the kernel that produced the warning in the first place! If

294

+not, the line number information will almost certainly be wrong.

295

+

296

+The "-i" tells addr2line to also print the line numbers of inlined functions.

297

+In this case, the flag was very important, because otherwise, it would only

298

+have printed the first line, which is just a call to memcpy(), which could be

299

+called from a thousand places in the kernel, and is therefore not very useful.

300

+These inlined functions would not show up in the stack trace above, simply

301

+because the kernel doesn't load the extra debugging information. This

302

+technique can of course be used with ordinary kernel oopses as well.

303

+

304

+In this case, it's the caller of memcpy() that is interesting, and it can be

305

+found in include/asm-generic/siginfo.h, line 287:

306

+

307

+281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)

308

+282 {

309

+283 if (from->si_code < 0)

310

+284 memcpy(to, from, sizeof(*to));

311

+285 else

312

+286 /* _sigchld is currently the largest know union member */

313

+287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld));

314

+288 }

315

+

316

+Since this was a read (kmemcheck usually warns about reads only, though it can

317

+warn about writes to unallocated or freed memory as well), it was probably the

318

+"from" argument which contained some uninitialized bytes. Following the chain

319

+of calls, we move upwards to see where "from" was allocated or initialized,

320

+kernel/signal.c, line 380:

321

+

322

+359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)

323

+360 {

324

+...

325

+367 list_for_each_entry(q, &list->list, list) {

326

+368 if (q->info.si_signo == sig) {

327

+369 if (first)

328

+370 goto still_pending;

+371 first = q;

+...

+377 if (first) {

+378 still_pending:

+379 list_del_init(&first->list);

334

+380 copy_siginfo(info, &first->info);

335

+381 __sigqueue_free(first);

+...

+392 }

+393 }

+

+Here, it is &first->info that is being passed on to copy_siginfo(). The

341

+variable "first" was found on a list -- passed in as the second argument to

342

+collect_signal(). We continue our journey through the stack, to figure out

343

+where the item on "list" was allocated or initialized. We move to line 410:

344

+

345

+395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,

346

+396 siginfo_t *info)

347

+397 {

348

+...

349

+410 collect_signal(sig, pending, info);

+...

+414 }

+

+Now we need to follow the "pending" pointer, since that is being passed on to

354

+collect_signal() as "list". At this point, we've run out of lines from the

355

+"addr2line" output. Not to worry, we just paste the next addresses from the

356

+kmemcheck stack dump, i.e.:

357

+

358

+ [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170

359

+ [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390

360

+ [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0

361

+ [<ffffffff8100c7b5>] int_signal+0x12/0x17

362

+

363

+ $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \

364

+ ffffffff8100b87d ffffffff8100c7b5

365

+ kernel/signal.c:446

366

+ kernel/signal.c:1806

367

+ arch/x86/kernel/signal.c:805

368

+ arch/x86/kernel/signal.c:871

369

+ arch/x86/kernel/entry_64.S:694

370

+

371

+Remember that since these addresses were found on the stack and not as the

372

+RIP value, they actually point to the _next_ instruction (they are return

373

+addresses). This becomes obvious when we look at the code for line 446:

374

+

375

+422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)

376

+423 {

377

+...

378

+431 signr = __dequeue_signal(&tsk->signal->shared_pending,

379

+432 mask, info);

380

+433 /*

381

+434 * itimer signal ?

382

+435 *

383

+436 * itimers are process shared and we restart periodic

384

+437 * itimers in the signal delivery path to prevent DoS

385

+438 * attacks in the high resolution timer case. This is

386

+439 * compliant with the old way of self restarting

387

+440 * itimers, as the SIGALRM is a legacy signal and only

388

+441 * queued once. Changing the restart behaviour to

389

+442 * restart the timer in the signal dequeue path is

390

+443 * reducing the timer noise on heavy loaded !highres

391

+444 * systems too.

392

+445 */

393

+446 if (unlikely(signr == SIGALRM)) {

+...

+489 }

+

+So instead of looking at 446, we should be looking at 431, which is the line

398

+that executes just before 446. Here we see that what we are looking for is

399

+&tsk->signal->shared_pending.

400

+

401

+Our next task is now to figure out which function that puts items on this

402

+"shared_pending" list. A crude, but efficient tool, is git grep:

403

+

404

+ $ git grep -n 'shared_pending' kernel/

405

+ ...

406

+ kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending;

407

+ kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending;

408

+ ...

409

+

410

+There were more results, but none of them were related to list operations,

411

+and these were the only assignments. We inspect the line numbers more closely

412

+and find that this is indeed where items are being added to the list:

413

+

414

+816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,

+817 int group)

+818 {

+...

+828 pending = group ? &t->signal->shared_pending : &t->pending;

419

+...

420

+851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN &&

421

+852 (is_si_special(info) ||

422

+853 info->si_code >= 0)));

423

+854 if (q) {

424

+855 list_add_tail(&q->list, &pending->list);

+...

+890 }

+

+and:

+

+1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)

431

+1310 {

432

+....

433

+1339 pending = group ? &t->signal->shared_pending : &t->pending;

434

+1340 list_add_tail(&q->list, &pending->list);

+....

+1347 }

+

+In the first case, the list element we are looking for, "q", is being returned

439

+from the function __sigqueue_alloc(), which looks like an allocation function.

440

+Let's take a look at it:

441

+

442

+187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags,

443

+188 int override_rlimit)

444

+189 {

445

+190 struct sigqueue *q = NULL;

446

+191 struct user_struct *user;

447

+192

448

+193 /*

449

+194 * We won't get problems with the target's UID changing under us

450

+195 * because changing it requires RCU be used, and if t != current, the

451

+196 * caller must be holding the RCU readlock (by way of a spinlock) and

452

+197 * we use RCU protection here

453

+198 */

454

+199 user = get_uid(__task_cred(t)->user);

455

+200 atomic_inc(&user->sigpending);

456

+201 if (override_rlimit ||

457

+202 atomic_read(&user->sigpending) <=

458

+203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur)

459

+204 q = kmem_cache_alloc(sigqueue_cachep, flags);

460

+205 if (unlikely(q == NULL)) {

461

+206 atomic_dec(&user->sigpending);

462

+207 free_uid(user);

463

+208 } else {

464

+209 INIT_LIST_HEAD(&q->list);

+210 q->flags = 0;

+211 q->user = user;

+212 }

+213

+214 return q;

+215 }

+

+We see that this function initializes q->list, q->flags, and q->user. It seems

473

+that now is the time to look at the definition of "struct sigqueue", e.g.:

474

+

475

+14 struct sigqueue {

476

+15 struct list_head list;

477

+16 int flags;

478

+17 siginfo_t info;

479

+18 struct user_struct *user;

480

+19 };

481

+

482

+And, you might remember, it was a memcpy() on &first->info that caused the

483

+warning, so this makes perfect sense. It also seems reasonable to assume that

484

+it is the caller of __sigqueue_alloc() that has the responsibility of filling

485

+out (initializing) this member.

486

+

487

+But just which fields of the struct were uninitialized? Let's look at

488

+kmemcheck's report again:

489

+

490

+WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024)

491

+80000000000000000000000000000000000000000088ffff0000000000000000

492

+ i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u

493

+ ^

494

+

495

+These first two lines are the memory dump of the memory object itself, and the

496

+shadow bytemap, respectively. The memory object itself is in this case

497

+&first->info. Just beware that the start of this dump is NOT the start of the

498

+object itself! The position of the caret (^) corresponds with the address of

499

+the read (ffff88003e4a2024).

500

+

501

+The shadow bytemap dump legend is as follows:

+

+ i - initialized

+ u - uninitialized

+ a - unallocated (memory has been allocated by the slab layer, but has not

506

+ yet been handed off to anybody)

507

+ f - freed (memory has been allocated by the slab layer, but has been freed

508

+ by the previous owner)

509

+

510

+In order to figure out where (relative to the start of the object) the

511

+uninitialized memory was located, we have to look at the disassembly. For

512

+that, we'll need the RIP address again:

513

+

514

+RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190

515

+

516

+ $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8:

517

+ ffffffff8104edc8: mov %r8,0x8(%r8)

518

+ ffffffff8104edcc: test %r10d,%r10d

519

+ ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168>

520

+ ffffffff8104edd5: mov %rax,%rdx

521

+ ffffffff8104edd8: mov $0xc,%ecx

522

+ ffffffff8104eddd: mov %r13,%rdi

523

+ ffffffff8104ede0: mov $0x30,%eax

524

+ ffffffff8104ede5: mov %rdx,%rsi

525

+ ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi)

526

+ ffffffff8104edea: test $0x2,%al

527

+ ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0>

528

+ ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi)

529

+ ffffffff8104edf0: test $0x1,%al

530

+ ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5>

531

+ ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi)

532

+ ffffffff8104edf5: mov %r8,%rdi

533

+ ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free>

534

+

535

+As expected, it's the "rep movsl" instruction from the memcpy() that causes

536

+the warning. We know about REP MOVSL that it uses the register RCX to count

537

+the number of remaining iterations. By taking a look at the register dump

538

+again (from the kmemcheck report), we can figure out how many bytes were left

539

+to copy:

540

+

541

+RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009

542

+

543

+By looking at the disassembly, we also see that %ecx is being loaded with the

544

+value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind

545

+that this is the number of iterations, not bytes. And since this is a "long"

546

+operation, we need to multiply by 4 to get the number of bytes. So this means

547

+that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes

548

+from the start of the object.

549

+

550

+We can now try to figure out which field of the "struct siginfo" that was not

551

+initialized. This is the beginning of the struct:

552

+

553

+40 typedef struct siginfo {

+41 int si_signo;

+42 int si_errno;

+43 int si_code;

+44

+45 union {

+..

+92 } _sifields;

+93 } siginfo_t;

+

+On 64-bit, the int is 4 bytes long, so it must the the union member that has

564

+not been initialized. We can verify this using gdb:

+

+ $ gdb vmlinux

+ ...

+ (gdb) p &((struct siginfo *) 0)->_sifields

569

+ $1 = (union {...} *) 0x10

570

+

571

+Actually, it seems that the union member is located at offset 0x10 -- which

572

+means that gcc has inserted 4 bytes of padding between the members si_code

573

+and _sifields. We can now get a fuller picture of the memory dump:

574

+

575

+ _----------------------------=> si_code

576

+ / _--------------------=> (padding)

577

+ | / _------------=> _sifields(._kill._pid)

578

+ | | / _----=> _sifields(._kill._uid)

579

+ | | | /

580

+-------|-------|-------|-------|

581

+80000000000000000000000000000000000000000088ffff0000000000000000

582

+ i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u

583

+

584

+This allows us to realize another important fact: si_code contains the value

585

+0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are

586

+really the number 0x00000080. With a bit of research, we find that this is

587

+actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h:

588

+

589

+144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */

590

+

591

+This macro is used in exactly one place in the x86 kernel: In send_signal()

592

+in kernel/signal.c:

593

+

594

+816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,

+817 int group)

+818 {

+...

+828 pending = group ? &t->signal->shared_pending : &t->pending;

599

+...

600

+851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN &&

601

+852 (is_si_special(info) ||

602

+853 info->si_code >= 0)));

603

+854 if (q) {

604

+855 list_add_tail(&q->list, &pending->list);

605

+856 switch ((unsigned long) info) {

606

+...

607

+865 case (unsigned long) SEND_SIG_PRIV:

608

+866 q->info.si_signo = sig;

609

+867 q->info.si_errno = 0;

610

+868 q->info.si_code = SI_KERNEL;

611

+869 q->info.si_pid = 0;

612

+870 q->info.si_uid = 0;

+871 break;

+...

+890 }

+

+Not only does this match with the .si_code member, it also matches the place

618

+we found earlier when looking for where siginfo_t objects are enqueued on the

619

+"shared_pending" list.

620

+

621

+So to sum up: It seems that it is the padding introduced by the compiler

622

+between two struct fields that is uninitialized, and this gets reported when

623

+we do a memcpy() on the struct. This means that we have identified a false

624

+positive warning.

625

+

626

+Normally, kmemcheck will not report uninitialized accesses in memcpy() calls

627

+when both the source and destination addresses are tracked. (Instead, we copy

628

+the shadow bytemap as well). In this case, the destination address clearly

629

+was not tracked. We can dig a little deeper into the stack trace from above:

630

+

631

+ arch/x86/kernel/signal.c:805

632

+ arch/x86/kernel/signal.c:871

633

+ arch/x86/kernel/entry_64.S:694

634

+

635

+And we clearly see that the destination siginfo object is located on the

636

+stack:

637

+

638

+782 static void do_signal(struct pt_regs *regs)

639

+783 {

640

+784 struct k_sigaction ka;

641

+785 siginfo_t info;

642

+...

643

+804 signr = get_signal_to_deliver(&info, &ka, regs, NULL);

+...

+854 }

+

+And this &info is what eventually gets passed to copy_siginfo() as the

648

+destination argument.

649

+

650

+Now, even though we didn't find an actual error here, the example is still a

651

+good one, because it shows how one would go about to find out what the report

+was all about.

+

+

+3.4. Annotating false positives

656

+===============================

657

+

658

+There are a few different ways to make annotations in the source code that

659

+will keep kmemcheck from checking and reporting certain allocations. Here

660

+they are:

661

+

662

+ o __GFP_NOTRACK_FALSE_POSITIVE

663

+

664

+ This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore

665

+ also to other functions that end up calling one of these) to indicate

666

+ that the allocation should not be tracked because it would lead to

667

+ a false positive report. This is a "big hammer" way of silencing

668

+ kmemcheck; after all, even if the false positive pertains to

669

+ particular field in a struct, for example, we will now lose the

670

+ ability to find (real) errors in other parts of the same struct.

+

+ Example:

+

+ /* No warnings will ever trigger on accessing any part of x */

675

+ x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE);

676

+

677

+ o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and

678

+ kmemcheck_annotate_bitfield(ptr, name)

679

+

680

+ The first two of these three macros can be used inside struct

681

+ definitions to signal, respectively, the beginning and end of a

682

+ bitfield. Additionally, this will assign the bitfield a name, which

683

+ is given as an argument to the macros.

684

+

685

+ Having used these markers, one can later use

686

+ kmemcheck_annotate_bitfield() at the point of allocation, to indicate

687

+ which parts of the allocation is part of a bitfield.

+

+ Example:

+

+ struct foo {

+ int x;

+

+ kmemcheck_bitfield_begin(flags);

695

+ int flag_a:1;

696

+ int flag_b:1;

697

+ kmemcheck_bitfield_end(flags);

+

+ int y;

+ };

+

+ struct foo *x = kmalloc(sizeof *x);

703

+

704

+ /* No warnings will trigger on accessing the bitfield of x */

705

+ kmemcheck_annotate_bitfield(x, flags);

706

+

707

+ Note that kmemcheck_annotate_bitfield() can be used even before the

708

+ return value of kmalloc() is checked -- in other words, passing NULL

709

+ as the first argument is legal (and will do nothing).

+

+

+4. Reporting errors

+===================

+

+As we have seen, kmemcheck will produce false positive reports. Therefore, it

716

+is not very wise to blindly post kmemcheck warnings to mailing lists and

717

+maintainers. Instead, I encourage maintainers and developers to find errors

718

+in their own code. If you get a warning, you can try to work around it, try

719

+to figure out if it's a real error or not, or simply ignore it. Most

720

+developers know their own code and will quickly and efficiently determine the

721

+root cause of a kmemcheck report. This is therefore also the most efficient

722

+way to work with kmemcheck.

723

+

724

+That said, we (the kmemcheck maintainers) will always be on the lookout for

725

+false positives that we can annotate and silence. So whatever you find,

726

+please drop us a note privately! Kernel configs and steps to reproduce (if

727

+available) are of course a great help too.

+

+Happy hacking!

+

+

+5. Technical description

733

+========================

734

+

735

+kmemcheck works by marking memory pages non-present. This means that whenever

736

+somebody attempts to access the page, a page fault is generated. The page

737

+fault handler notices that the page was in fact only hidden, and so it calls

738

+on the kmemcheck code to make further investigations.

739

+

740

+When the investigations are completed, kmemcheck "shows" the page by marking

741

+it present (as it would be under normal circumstances). This way, the

742

+interrupted code can continue as usual.

743

+

744

+But after the instruction has been executed, we should hide the page again, so

745

+that we can catch the next access too! Now kmemcheck makes use of a debugging

746

+feature of the processor, namely single-stepping. When the processor has

747

+finished the one instruction that generated the memory access, a debug

748

+exception is raised. From here, we simply hide the page again and continue

749

+execution, this time with the single-stepping feature turned off.

750

+

751

+kmemcheck requires some assistance from the memory allocator in order to work.

752

+The memory allocator needs to

753

+

754

+ 1. Tell kmemcheck about newly allocated pages and pages that are about to

755

+ be freed. This allows kmemcheck to set up and tear down the shadow memory

756

+ for the pages in question. The shadow memory stores the status of each

757

+ byte in the allocation proper, e.g. whether it is initialized or

758

+ uninitialized.

759

+

760

+ 2. Tell kmemcheck which parts of memory should be marked uninitialized.

761

+ There are actually a few more states, such as "not yet allocated" and

762

+ "recently freed".

763

+

764

+If a slab cache is set up using the SLAB_NOTRACK flag, it will never return

765

+memory that can take page faults because of kmemcheck.

766

+

767

+If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still

768

+request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags.

769

+This does not prevent the page faults from occurring, however, but marks the

770

+object in question as being initialized so that no warnings will ever be

771

+produced for this object.

772

+

773

+Currently, the SLAB and SLUB allocators are supported by kmemcheck.

GITLAB

Eric Lee / linux-smarc-t335x-v3.2

kmemcheck: add the kmemcheck documentation