Blame view

Documentation/BUG-HUNTING 8.13 KB
43019a56a   Ian McDonald   Documentation: Up...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
  Table of contents
  =================
  
  Last updated: 20 December 2005
  
  Contents
  ========
  
  - Introduction
  - Devices not appearing
  - Finding patch that caused a bug
  -- Finding using git-bisect
  -- Finding it the old way
  - Fixing the bug
  
  Introduction
  ============
  
  Always try the latest kernel from kernel.org and build from source. If you are
  not confident in doing that please report the bug to your distribution vendor
  instead of to a kernel developer.
  
  Finding bugs is not always easy. Have a go though. If you can't find it don't
  give up. Report as much as you have found to the relevant maintainer. See
  MAINTAINERS for who that is for the subsystem you have worked on.
  
  Before you submit a bug report read REPORTING-BUGS.
  
  Devices not appearing
  =====================
  
  Often this is caused by udev. Check that first before blaming it on the
  kernel.
  
  Finding patch that caused a bug
  ===============================
  
  
  
  Finding using git-bisect
  ------------------------
  
  Using the provided tools with git makes finding bugs easy provided the bug is
  reproducible.
  
  Steps to do it:
  - start using git for the kernel source
  - read the man page for git-bisect
  - have fun
  
  Finding it the old way
  ----------------------
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
53
  [Sat Mar  2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)]
d81919c9c   Clemens Koller   Documentation/BUG...
54
  This is how to track down a bug if you know nothing about kernel hacking.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
55
56
57
58
59
60
61
62
63
64
65
66
  It's a brute force approach but it works pretty well.
  
  You need:
  
          . A reproducible bug - it has to happen predictably (sorry)
          . All the kernel tar files from a revision that worked to the
            revision that doesn't
  
  You will then do:
  
          . Rebuild a revision that you believe works, install, and verify that.
          . Do a binary search over the kernels to figure out which one
d81919c9c   Clemens Koller   Documentation/BUG...
67
            introduced the bug.  I.e., suppose 1.3.28 didn't have the bug, but
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
68
69
70
71
            you know that 1.3.69 does.  Pick a kernel in the middle and build
            that, like 1.3.50.  Build & test; if it works, pick the mid point
            between .50 and .69, else the mid point between .28 and .50.
          . You'll narrow it down to the kernel that introduced the bug.  You
d81919c9c   Clemens Koller   Documentation/BUG...
72
            can probably do better than this but it gets tricky.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
73
74
75
76
77
78
79
80
81
  
          . Narrow it down to a subdirectory
  
            - Copy kernel that works into "test".  Let's say that 3.62 works,
              but 3.63 doesn't.  So you diff -r those two kernels and come
              up with a list of directories that changed.  For each of those
              directories:
  
                  Copy the non-working directory next to the working directory
d81919c9c   Clemens Koller   Documentation/BUG...
82
                  as "dir.63".
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
83
                  One directory at time, try moving the working directory to
d81919c9c   Clemens Koller   Documentation/BUG...
84
                  "dir.62" and mv dir.63 dir"time, try
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
85
86
87
88
89
90
  
                          mv dir dir.62
                          mv dir.63 dir
                          find dir -name '*.[oa]' -print | xargs rm -f
  
                  And then rebuild and retest.  Assuming that all related
d81919c9c   Clemens Koller   Documentation/BUG...
91
92
                  changes were contained in the sub directory, this should
                  isolate the change to a directory.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
93
94
  
                  Problems: changes in header files may have occurred; I've
d81919c9c   Clemens Koller   Documentation/BUG...
95
                  found in my case that they were self explanatory - you may
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
96
97
98
99
100
                  or may not want to give up when that happens.
  
          . Narrow it down to a file
  
            - You can apply the same technique to each file in the directory,
d81919c9c   Clemens Koller   Documentation/BUG...
101
              hoping that the changes in that file are self contained.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
          . Narrow it down to a routine
  
            - You can take the old file and the new file and manually create
              a merged file that has
  
                  #ifdef VER62
                  routine()
                  {
                          ...
                  }
                  #else
                  routine()
                  {
                          ...
                  }
                  #endif
  
              And then walk through that file, one routine at a time and
              prefix it with
  
                  #define VER62
                  /* both routines here */
                  #undef VER62
  
              Then recompile, retest, move the ifdefs until you find the one
              that makes the difference.
  
  Finally, you take all the info that you have, kernel revisions, bug
d81919c9c   Clemens Koller   Documentation/BUG...
130
  description, the extent to which you have narrowed it down, and pass
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
131
132
133
134
135
136
137
138
139
140
141
  that off to whomever you believe is the maintainer of that section.
  A post to linux.dev.kernel isn't such a bad idea if you've done some
  work to narrow it down.
  
  If you get it down to a routine, you'll probably get a fix in 24 hours.
  
  My apologies to Linus and the other kernel hackers for describing this
  brute force approach, it's hardly what a kernel hacker would do.  However,
  it does work and it lets non-hackers help fix bugs.  And it is cool
  because Linux snapshots will let you do this - something that you can't
  do with vendor supplied releases.
43019a56a   Ian McDonald   Documentation: Up...
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
  Fixing the bug
  ==============
  
  Nobody is going to tell you how to fix bugs. Seriously. You need to work it
  out. But below are some hints on how to use the tools.
  
  To debug a kernel, use objdump and look for the hex offset from the crash
  output to find the valid line of code/assembler. Without debug symbols, you
  will see the assembler code for the routine shown, but if your kernel has
  debug symbols the C code will also be available. (Debug symbols can be enabled
  in the kernel hacking menu of the menu configuration.) For example:
  
      objdump -r -S -l --disassemble net/dccp/ipv4.o
  
  NB.: you need to be at the top level of the kernel tree for this to pick up
  your C files.
  
  If you don't have access to the code you can also debug on some crash dumps
  e.g. crash dump output as shown by Dave Miller.
  
  >    EIP is at ip_queue_xmit+0x14/0x4c0
  >     ...
  >    Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
  >    00 00 55 57  56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
  >    <8b> 83 3c 01 00 00 89 44  24 14 8b 45 28 85 c0 89 44 24 18 0f 85
  >
  >    Put the bytes into a "foo.s" file like this:
  >
  >           .text
  >           .globl foo
  >    foo:
  >           .byte  .... /* bytes from Code: part of OOPS dump */
  >
  >    Compile it with "gcc -c -o foo.o foo.s" then look at the output of
  >    "objdump --disassemble foo.o".
  >
  >    Output:
  >
  >    ip_queue_xmit:
  >        push       %ebp
  >        push       %edi
  >        push       %esi
  >        push       %ebx
  >        sub        $0xbc, %esp
  >        mov        0xd0(%esp), %ebp        ! %ebp = arg0 (skb)
  >        mov        0x8(%ebp), %ebx         ! %ebx = skb->sk
  >        mov        0x13c(%ebx), %eax       ! %eax = inet_sk(sk)->opt
926b28984   Pekka Enberg   Documentation: Ho...
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
  In addition, you can use GDB to figure out the exact file and line
  number of the OOPS from the vmlinux file. If you have
  CONFIG_DEBUG_INFO enabled, you can simply copy the EIP value from the
  OOPS:
  
   EIP:    0060:[<c021e50e>]    Not tainted VLI
  
  And use GDB to translate that to human-readable form:
  
    gdb vmlinux
    (gdb) l *0xc021e50e
  
  If you don't have CONFIG_DEBUG_INFO enabled, you use the function
  offset from the OOPS:
  
   EIP is at vt_ioctl+0xda8/0x1482
  
  And recompile the kernel with CONFIG_DEBUG_INFO enabled:
  
    make vmlinux
    gdb vmlinux
    (gdb) p vt_ioctl
    (gdb) l *(0x<address of vt_ioctl> + 0xda8)
dcc85cb61   Richard Kennedy   Documentation: ad...
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
  or, as one command
    (gdb) l *(vt_ioctl + 0xda8)
  
  If you have a call trace, such as :-
  >Call Trace:
  > [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
  > [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
  > [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
  > ...
  this shows the problem in the :jbd: module. You can load that module in gdb
  and list the relevant code.
    gdb fs/jbd/jbd.ko
    (gdb) p log_wait_commit
    (gdb) l *(0x<address> + 0xa3)
  or
    (gdb) l *(log_wait_commit + 0xa3)
926b28984   Pekka Enberg   Documentation: Ho...
228

43019a56a   Ian McDonald   Documentation: Up...
229
230
231
232
233
234
235
236
237
238
239
240
  Another very useful option of the Kernel Hacking section in menuconfig is
  Debug memory allocations. This will help you see whether data has been
  initialised and not set before use etc. To see the values that get assigned
  with this look at mm/slab.c and search for POISON_INUSE. When using this an
  Oops will often show the poisoned data instead of zero which is the default.
  
  Once you have worked out a fix please submit it upstream. After all open
  source is about sharing what you do and don't you want to be recognised for
  your genius?
  
  Please do read Documentation/SubmittingPatches though to help your code get
  accepted.