Commit 2940b26bec9fe5bf183c994678e62b55d35717e6
Committed by
David S. Miller
1 parent
b9c32fb271
Exists in
smarc-l5.0.0_1.0.0-ga
and in
5 other branches
packet: doc: update timestamping part
Bring the timestamping section in sync with the implementation. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Showing 1 changed file with 35 additions and 6 deletions Inline Diff
Documentation/networking/packet_mmap.txt
1 | -------------------------------------------------------------------------------- | 1 | -------------------------------------------------------------------------------- |
2 | + ABSTRACT | 2 | + ABSTRACT |
3 | -------------------------------------------------------------------------------- | 3 | -------------------------------------------------------------------------------- |
4 | 4 | ||
5 | This file documents the mmap() facility available with the PACKET | 5 | This file documents the mmap() facility available with the PACKET |
6 | socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for | 6 | socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for |
7 | i) capture network traffic with utilities like tcpdump, ii) transmit network | 7 | i) capture network traffic with utilities like tcpdump, ii) transmit network |
8 | traffic, or any other that needs raw access to network interface. | 8 | traffic, or any other that needs raw access to network interface. |
9 | 9 | ||
10 | You can find the latest version of this document at: | 10 | You can find the latest version of this document at: |
11 | http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap | 11 | http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap |
12 | 12 | ||
13 | Howto can be found at: | 13 | Howto can be found at: |
14 | http://wiki.gnu-log.net (packet_mmap) | 14 | http://wiki.gnu-log.net (packet_mmap) |
15 | 15 | ||
16 | Please send your comments to | 16 | Please send your comments to |
17 | Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> | 17 | Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> |
18 | Johann Baudy <johann.baudy@gnu-log.net> | 18 | Johann Baudy <johann.baudy@gnu-log.net> |
19 | 19 | ||
20 | ------------------------------------------------------------------------------- | 20 | ------------------------------------------------------------------------------- |
21 | + Why use PACKET_MMAP | 21 | + Why use PACKET_MMAP |
22 | -------------------------------------------------------------------------------- | 22 | -------------------------------------------------------------------------------- |
23 | 23 | ||
24 | In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very | 24 | In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very |
25 | inefficient. It uses very limited buffers and requires one system call to | 25 | inefficient. It uses very limited buffers and requires one system call to |
26 | capture each packet, it requires two if you want to get packet's timestamp | 26 | capture each packet, it requires two if you want to get packet's timestamp |
27 | (like libpcap always does). | 27 | (like libpcap always does). |
28 | 28 | ||
29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size | 29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size |
30 | configurable circular buffer mapped in user space that can be used to either | 30 | configurable circular buffer mapped in user space that can be used to either |
31 | send or receive packets. This way reading packets just needs to wait for them, | 31 | send or receive packets. This way reading packets just needs to wait for them, |
32 | most of the time there is no need to issue a single system call. Concerning | 32 | most of the time there is no need to issue a single system call. Concerning |
33 | transmission, multiple packets can be sent through one system call to get the | 33 | transmission, multiple packets can be sent through one system call to get the |
34 | highest bandwidth. By using a shared buffer between the kernel and the user | 34 | highest bandwidth. By using a shared buffer between the kernel and the user |
35 | also has the benefit of minimizing packet copies. | 35 | also has the benefit of minimizing packet copies. |
36 | 36 | ||
37 | It's fine to use PACKET_MMAP to improve the performance of the capture and | 37 | It's fine to use PACKET_MMAP to improve the performance of the capture and |
38 | transmission process, but it isn't everything. At least, if you are capturing | 38 | transmission process, but it isn't everything. At least, if you are capturing |
39 | at high speeds (this is relative to the cpu speed), you should check if the | 39 | at high speeds (this is relative to the cpu speed), you should check if the |
40 | device driver of your network interface card supports some sort of interrupt | 40 | device driver of your network interface card supports some sort of interrupt |
41 | load mitigation or (even better) if it supports NAPI, also make sure it is | 41 | load mitigation or (even better) if it supports NAPI, also make sure it is |
42 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and | 42 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and |
43 | supported by devices of your network. CPU IRQ pinning of your network interface | 43 | supported by devices of your network. CPU IRQ pinning of your network interface |
44 | card can also be an advantage. | 44 | card can also be an advantage. |
45 | 45 | ||
46 | -------------------------------------------------------------------------------- | 46 | -------------------------------------------------------------------------------- |
47 | + How to use mmap() to improve capture process | 47 | + How to use mmap() to improve capture process |
48 | -------------------------------------------------------------------------------- | 48 | -------------------------------------------------------------------------------- |
49 | 49 | ||
50 | From the user standpoint, you should use the higher level libpcap library, which | 50 | From the user standpoint, you should use the higher level libpcap library, which |
51 | is a de facto standard, portable across nearly all operating systems | 51 | is a de facto standard, portable across nearly all operating systems |
52 | including Win32. | 52 | including Win32. |
53 | 53 | ||
54 | Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include | 54 | Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include |
55 | support for PACKET_MMAP, and also probably the libpcap included in your distribution. | 55 | support for PACKET_MMAP, and also probably the libpcap included in your distribution. |
56 | 56 | ||
57 | I'm aware of two implementations of PACKET_MMAP in libpcap: | 57 | I'm aware of two implementations of PACKET_MMAP in libpcap: |
58 | 58 | ||
59 | http://wiki.ipxwarzone.com/ (by Simon Patarin, based on libpcap 0.6.2) | 59 | http://wiki.ipxwarzone.com/ (by Simon Patarin, based on libpcap 0.6.2) |
60 | http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap) | 60 | http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap) |
61 | 61 | ||
62 | The rest of this document is intended for people who want to understand | 62 | The rest of this document is intended for people who want to understand |
63 | the low level details or want to improve libpcap by including PACKET_MMAP | 63 | the low level details or want to improve libpcap by including PACKET_MMAP |
64 | support. | 64 | support. |
65 | 65 | ||
66 | -------------------------------------------------------------------------------- | 66 | -------------------------------------------------------------------------------- |
67 | + How to use mmap() directly to improve capture process | 67 | + How to use mmap() directly to improve capture process |
68 | -------------------------------------------------------------------------------- | 68 | -------------------------------------------------------------------------------- |
69 | 69 | ||
70 | From the system calls stand point, the use of PACKET_MMAP involves | 70 | From the system calls stand point, the use of PACKET_MMAP involves |
71 | the following process: | 71 | the following process: |
72 | 72 | ||
73 | 73 | ||
74 | [setup] socket() -------> creation of the capture socket | 74 | [setup] socket() -------> creation of the capture socket |
75 | setsockopt() ---> allocation of the circular buffer (ring) | 75 | setsockopt() ---> allocation of the circular buffer (ring) |
76 | option: PACKET_RX_RING | 76 | option: PACKET_RX_RING |
77 | mmap() ---------> mapping of the allocated buffer to the | 77 | mmap() ---------> mapping of the allocated buffer to the |
78 | user process | 78 | user process |
79 | 79 | ||
80 | [capture] poll() ---------> to wait for incoming packets | 80 | [capture] poll() ---------> to wait for incoming packets |
81 | 81 | ||
82 | [shutdown] close() --------> destruction of the capture socket and | 82 | [shutdown] close() --------> destruction of the capture socket and |
83 | deallocation of all associated | 83 | deallocation of all associated |
84 | resources. | 84 | resources. |
85 | 85 | ||
86 | 86 | ||
87 | socket creation and destruction is straight forward, and is done | 87 | socket creation and destruction is straight forward, and is done |
88 | the same way with or without PACKET_MMAP: | 88 | the same way with or without PACKET_MMAP: |
89 | 89 | ||
90 | int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); | 90 | int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); |
91 | 91 | ||
92 | where mode is SOCK_RAW for the raw interface were link level | 92 | where mode is SOCK_RAW for the raw interface were link level |
93 | information can be captured or SOCK_DGRAM for the cooked | 93 | information can be captured or SOCK_DGRAM for the cooked |
94 | interface where link level information capture is not | 94 | interface where link level information capture is not |
95 | supported and a link level pseudo-header is provided | 95 | supported and a link level pseudo-header is provided |
96 | by the kernel. | 96 | by the kernel. |
97 | 97 | ||
98 | The destruction of the socket and all associated resources | 98 | The destruction of the socket and all associated resources |
99 | is done by a simple call to close(fd). | 99 | is done by a simple call to close(fd). |
100 | 100 | ||
101 | Next I will describe PACKET_MMAP settings and its constraints, | 101 | Next I will describe PACKET_MMAP settings and its constraints, |
102 | also the mapping of the circular buffer in the user process and | 102 | also the mapping of the circular buffer in the user process and |
103 | the use of this buffer. | 103 | the use of this buffer. |
104 | 104 | ||
105 | -------------------------------------------------------------------------------- | 105 | -------------------------------------------------------------------------------- |
106 | + How to use mmap() directly to improve transmission process | 106 | + How to use mmap() directly to improve transmission process |
107 | -------------------------------------------------------------------------------- | 107 | -------------------------------------------------------------------------------- |
108 | Transmission process is similar to capture as shown below. | 108 | Transmission process is similar to capture as shown below. |
109 | 109 | ||
110 | [setup] socket() -------> creation of the transmission socket | 110 | [setup] socket() -------> creation of the transmission socket |
111 | setsockopt() ---> allocation of the circular buffer (ring) | 111 | setsockopt() ---> allocation of the circular buffer (ring) |
112 | option: PACKET_TX_RING | 112 | option: PACKET_TX_RING |
113 | bind() ---------> bind transmission socket with a network interface | 113 | bind() ---------> bind transmission socket with a network interface |
114 | mmap() ---------> mapping of the allocated buffer to the | 114 | mmap() ---------> mapping of the allocated buffer to the |
115 | user process | 115 | user process |
116 | 116 | ||
117 | [transmission] poll() ---------> wait for free packets (optional) | 117 | [transmission] poll() ---------> wait for free packets (optional) |
118 | send() ---------> send all packets that are set as ready in | 118 | send() ---------> send all packets that are set as ready in |
119 | the ring | 119 | the ring |
120 | The flag MSG_DONTWAIT can be used to return | 120 | The flag MSG_DONTWAIT can be used to return |
121 | before end of transfer. | 121 | before end of transfer. |
122 | 122 | ||
123 | [shutdown] close() --------> destruction of the transmission socket and | 123 | [shutdown] close() --------> destruction of the transmission socket and |
124 | deallocation of all associated resources. | 124 | deallocation of all associated resources. |
125 | 125 | ||
126 | Binding the socket to your network interface is mandatory (with zero copy) to | 126 | Binding the socket to your network interface is mandatory (with zero copy) to |
127 | know the header size of frames used in the circular buffer. | 127 | know the header size of frames used in the circular buffer. |
128 | 128 | ||
129 | As capture, each frame contains two parts: | 129 | As capture, each frame contains two parts: |
130 | 130 | ||
131 | -------------------- | 131 | -------------------- |
132 | | struct tpacket_hdr | Header. It contains the status of | 132 | | struct tpacket_hdr | Header. It contains the status of |
133 | | | of this frame | 133 | | | of this frame |
134 | |--------------------| | 134 | |--------------------| |
135 | | data buffer | | 135 | | data buffer | |
136 | . . Data that will be sent over the network interface. | 136 | . . Data that will be sent over the network interface. |
137 | . . | 137 | . . |
138 | -------------------- | 138 | -------------------- |
139 | 139 | ||
140 | bind() associates the socket to your network interface thanks to | 140 | bind() associates the socket to your network interface thanks to |
141 | sll_ifindex parameter of struct sockaddr_ll. | 141 | sll_ifindex parameter of struct sockaddr_ll. |
142 | 142 | ||
143 | Initialization example: | 143 | Initialization example: |
144 | 144 | ||
145 | struct sockaddr_ll my_addr; | 145 | struct sockaddr_ll my_addr; |
146 | struct ifreq s_ifr; | 146 | struct ifreq s_ifr; |
147 | ... | 147 | ... |
148 | 148 | ||
149 | strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); | 149 | strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); |
150 | 150 | ||
151 | /* get interface index of eth0 */ | 151 | /* get interface index of eth0 */ |
152 | ioctl(this->socket, SIOCGIFINDEX, &s_ifr); | 152 | ioctl(this->socket, SIOCGIFINDEX, &s_ifr); |
153 | 153 | ||
154 | /* fill sockaddr_ll struct to prepare binding */ | 154 | /* fill sockaddr_ll struct to prepare binding */ |
155 | my_addr.sll_family = AF_PACKET; | 155 | my_addr.sll_family = AF_PACKET; |
156 | my_addr.sll_protocol = htons(ETH_P_ALL); | 156 | my_addr.sll_protocol = htons(ETH_P_ALL); |
157 | my_addr.sll_ifindex = s_ifr.ifr_ifindex; | 157 | my_addr.sll_ifindex = s_ifr.ifr_ifindex; |
158 | 158 | ||
159 | /* bind socket to eth0 */ | 159 | /* bind socket to eth0 */ |
160 | bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); | 160 | bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); |
161 | 161 | ||
162 | A complete tutorial is available at: http://wiki.gnu-log.net/ | 162 | A complete tutorial is available at: http://wiki.gnu-log.net/ |
163 | 163 | ||
164 | By default, the user should put data at : | 164 | By default, the user should put data at : |
165 | frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) | 165 | frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) |
166 | 166 | ||
167 | So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), | 167 | So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), |
168 | the beginning of the user data will be at : | 168 | the beginning of the user data will be at : |
169 | frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | 169 | frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) |
170 | 170 | ||
171 | If you wish to put user data at a custom offset from the beginning of | 171 | If you wish to put user data at a custom offset from the beginning of |
172 | the frame (for payload alignment with SOCK_RAW mode for instance) you | 172 | the frame (for payload alignment with SOCK_RAW mode for instance) you |
173 | can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order | 173 | can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order |
174 | to make this work it must be enabled previously with setsockopt() | 174 | to make this work it must be enabled previously with setsockopt() |
175 | and the PACKET_TX_HAS_OFF option. | 175 | and the PACKET_TX_HAS_OFF option. |
176 | 176 | ||
177 | -------------------------------------------------------------------------------- | 177 | -------------------------------------------------------------------------------- |
178 | + PACKET_MMAP settings | 178 | + PACKET_MMAP settings |
179 | -------------------------------------------------------------------------------- | 179 | -------------------------------------------------------------------------------- |
180 | 180 | ||
181 | To setup PACKET_MMAP from user level code is done with a call like | 181 | To setup PACKET_MMAP from user level code is done with a call like |
182 | 182 | ||
183 | - Capture process | 183 | - Capture process |
184 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) | 184 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) |
185 | - Transmission process | 185 | - Transmission process |
186 | setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) | 186 | setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) |
187 | 187 | ||
188 | The most significant argument in the previous call is the req parameter, | 188 | The most significant argument in the previous call is the req parameter, |
189 | this parameter must to have the following structure: | 189 | this parameter must to have the following structure: |
190 | 190 | ||
191 | struct tpacket_req | 191 | struct tpacket_req |
192 | { | 192 | { |
193 | unsigned int tp_block_size; /* Minimal size of contiguous block */ | 193 | unsigned int tp_block_size; /* Minimal size of contiguous block */ |
194 | unsigned int tp_block_nr; /* Number of blocks */ | 194 | unsigned int tp_block_nr; /* Number of blocks */ |
195 | unsigned int tp_frame_size; /* Size of frame */ | 195 | unsigned int tp_frame_size; /* Size of frame */ |
196 | unsigned int tp_frame_nr; /* Total number of frames */ | 196 | unsigned int tp_frame_nr; /* Total number of frames */ |
197 | }; | 197 | }; |
198 | 198 | ||
199 | This structure is defined in /usr/include/linux/if_packet.h and establishes a | 199 | This structure is defined in /usr/include/linux/if_packet.h and establishes a |
200 | circular buffer (ring) of unswappable memory. | 200 | circular buffer (ring) of unswappable memory. |
201 | Being mapped in the capture process allows reading the captured frames and | 201 | Being mapped in the capture process allows reading the captured frames and |
202 | related meta-information like timestamps without requiring a system call. | 202 | related meta-information like timestamps without requiring a system call. |
203 | 203 | ||
204 | Frames are grouped in blocks. Each block is a physically contiguous | 204 | Frames are grouped in blocks. Each block is a physically contiguous |
205 | region of memory and holds tp_block_size/tp_frame_size frames. The total number | 205 | region of memory and holds tp_block_size/tp_frame_size frames. The total number |
206 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because | 206 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because |
207 | 207 | ||
208 | frames_per_block = tp_block_size/tp_frame_size | 208 | frames_per_block = tp_block_size/tp_frame_size |
209 | 209 | ||
210 | indeed, packet_set_ring checks that the following condition is true | 210 | indeed, packet_set_ring checks that the following condition is true |
211 | 211 | ||
212 | frames_per_block * tp_block_nr == tp_frame_nr | 212 | frames_per_block * tp_block_nr == tp_frame_nr |
213 | 213 | ||
214 | Lets see an example, with the following values: | 214 | Lets see an example, with the following values: |
215 | 215 | ||
216 | tp_block_size= 4096 | 216 | tp_block_size= 4096 |
217 | tp_frame_size= 2048 | 217 | tp_frame_size= 2048 |
218 | tp_block_nr = 4 | 218 | tp_block_nr = 4 |
219 | tp_frame_nr = 8 | 219 | tp_frame_nr = 8 |
220 | 220 | ||
221 | we will get the following buffer structure: | 221 | we will get the following buffer structure: |
222 | 222 | ||
223 | block #1 block #2 | 223 | block #1 block #2 |
224 | +---------+---------+ +---------+---------+ | 224 | +---------+---------+ +---------+---------+ |
225 | | frame 1 | frame 2 | | frame 3 | frame 4 | | 225 | | frame 1 | frame 2 | | frame 3 | frame 4 | |
226 | +---------+---------+ +---------+---------+ | 226 | +---------+---------+ +---------+---------+ |
227 | 227 | ||
228 | block #3 block #4 | 228 | block #3 block #4 |
229 | +---------+---------+ +---------+---------+ | 229 | +---------+---------+ +---------+---------+ |
230 | | frame 5 | frame 6 | | frame 7 | frame 8 | | 230 | | frame 5 | frame 6 | | frame 7 | frame 8 | |
231 | +---------+---------+ +---------+---------+ | 231 | +---------+---------+ +---------+---------+ |
232 | 232 | ||
233 | A frame can be of any size with the only condition it can fit in a block. A block | 233 | A frame can be of any size with the only condition it can fit in a block. A block |
234 | can only hold an integer number of frames, or in other words, a frame cannot | 234 | can only hold an integer number of frames, or in other words, a frame cannot |
235 | be spawned across two blocks, so there are some details you have to take into | 235 | be spawned across two blocks, so there are some details you have to take into |
236 | account when choosing the frame_size. See "Mapping and use of the circular | 236 | account when choosing the frame_size. See "Mapping and use of the circular |
237 | buffer (ring)". | 237 | buffer (ring)". |
238 | 238 | ||
239 | -------------------------------------------------------------------------------- | 239 | -------------------------------------------------------------------------------- |
240 | + PACKET_MMAP setting constraints | 240 | + PACKET_MMAP setting constraints |
241 | -------------------------------------------------------------------------------- | 241 | -------------------------------------------------------------------------------- |
242 | 242 | ||
243 | In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), | 243 | In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), |
244 | the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or | 244 | the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or |
245 | 16384 in a 64 bit architecture. For information on these kernel versions | 245 | 16384 in a 64 bit architecture. For information on these kernel versions |
246 | see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt | 246 | see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt |
247 | 247 | ||
248 | Block size limit | 248 | Block size limit |
249 | ------------------ | 249 | ------------------ |
250 | 250 | ||
251 | As stated earlier, each block is a contiguous physical region of memory. These | 251 | As stated earlier, each block is a contiguous physical region of memory. These |
252 | memory regions are allocated with calls to the __get_free_pages() function. As | 252 | memory regions are allocated with calls to the __get_free_pages() function. As |
253 | the name indicates, this function allocates pages of memory, and the second | 253 | the name indicates, this function allocates pages of memory, and the second |
254 | argument is "order" or a power of two number of pages, that is | 254 | argument is "order" or a power of two number of pages, that is |
255 | (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, | 255 | (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, |
256 | order=2 ==> 16384 bytes, etc. The maximum size of a | 256 | order=2 ==> 16384 bytes, etc. The maximum size of a |
257 | region allocated by __get_free_pages is determined by the MAX_ORDER macro. More | 257 | region allocated by __get_free_pages is determined by the MAX_ORDER macro. More |
258 | precisely the limit can be calculated as: | 258 | precisely the limit can be calculated as: |
259 | 259 | ||
260 | PAGE_SIZE << MAX_ORDER | 260 | PAGE_SIZE << MAX_ORDER |
261 | 261 | ||
262 | In a i386 architecture PAGE_SIZE is 4096 bytes | 262 | In a i386 architecture PAGE_SIZE is 4096 bytes |
263 | In a 2.4/i386 kernel MAX_ORDER is 10 | 263 | In a 2.4/i386 kernel MAX_ORDER is 10 |
264 | In a 2.6/i386 kernel MAX_ORDER is 11 | 264 | In a 2.6/i386 kernel MAX_ORDER is 11 |
265 | 265 | ||
266 | So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel | 266 | So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel |
267 | respectively, with an i386 architecture. | 267 | respectively, with an i386 architecture. |
268 | 268 | ||
269 | User space programs can include /usr/include/sys/user.h and | 269 | User space programs can include /usr/include/sys/user.h and |
270 | /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. | 270 | /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. |
271 | 271 | ||
272 | The pagesize can also be determined dynamically with the getpagesize (2) | 272 | The pagesize can also be determined dynamically with the getpagesize (2) |
273 | system call. | 273 | system call. |
274 | 274 | ||
275 | Block number limit | 275 | Block number limit |
276 | -------------------- | 276 | -------------------- |
277 | 277 | ||
278 | To understand the constraints of PACKET_MMAP, we have to see the structure | 278 | To understand the constraints of PACKET_MMAP, we have to see the structure |
279 | used to hold the pointers to each block. | 279 | used to hold the pointers to each block. |
280 | 280 | ||
281 | Currently, this structure is a dynamically allocated vector with kmalloc | 281 | Currently, this structure is a dynamically allocated vector with kmalloc |
282 | called pg_vec, its size limits the number of blocks that can be allocated. | 282 | called pg_vec, its size limits the number of blocks that can be allocated. |
283 | 283 | ||
284 | +---+---+---+---+ | 284 | +---+---+---+---+ |
285 | | x | x | x | x | | 285 | | x | x | x | x | |
286 | +---+---+---+---+ | 286 | +---+---+---+---+ |
287 | | | | | | 287 | | | | | |
288 | | | | v | 288 | | | | v |
289 | | | v block #4 | 289 | | | v block #4 |
290 | | v block #3 | 290 | | v block #3 |
291 | v block #2 | 291 | v block #2 |
292 | block #1 | 292 | block #1 |
293 | 293 | ||
294 | kmalloc allocates any number of bytes of physically contiguous memory from | 294 | kmalloc allocates any number of bytes of physically contiguous memory from |
295 | a pool of pre-determined sizes. This pool of memory is maintained by the slab | 295 | a pool of pre-determined sizes. This pool of memory is maintained by the slab |
296 | allocator which is at the end the responsible for doing the allocation and | 296 | allocator which is at the end the responsible for doing the allocation and |
297 | hence which imposes the maximum memory that kmalloc can allocate. | 297 | hence which imposes the maximum memory that kmalloc can allocate. |
298 | 298 | ||
299 | In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The | 299 | In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The |
300 | predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" | 300 | predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" |
301 | entries of /proc/slabinfo | 301 | entries of /proc/slabinfo |
302 | 302 | ||
303 | In a 32 bit architecture, pointers are 4 bytes long, so the total number of | 303 | In a 32 bit architecture, pointers are 4 bytes long, so the total number of |
304 | pointers to blocks is | 304 | pointers to blocks is |
305 | 305 | ||
306 | 131072/4 = 32768 blocks | 306 | 131072/4 = 32768 blocks |
307 | 307 | ||
308 | PACKET_MMAP buffer size calculator | 308 | PACKET_MMAP buffer size calculator |
309 | ------------------------------------ | 309 | ------------------------------------ |
310 | 310 | ||
311 | Definitions: | 311 | Definitions: |
312 | 312 | ||
313 | <size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo) | 313 | <size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo) |
314 | <pointer size>: depends on the architecture -- sizeof(void *) | 314 | <pointer size>: depends on the architecture -- sizeof(void *) |
315 | <page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2) | 315 | <page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2) |
316 | <max-order> : is the value defined with MAX_ORDER | 316 | <max-order> : is the value defined with MAX_ORDER |
317 | <frame size> : it's an upper bound of frame's capture size (more on this later) | 317 | <frame size> : it's an upper bound of frame's capture size (more on this later) |
318 | 318 | ||
319 | from these definitions we will derive | 319 | from these definitions we will derive |
320 | 320 | ||
321 | <block number> = <size-max>/<pointer size> | 321 | <block number> = <size-max>/<pointer size> |
322 | <block size> = <pagesize> << <max-order> | 322 | <block size> = <pagesize> << <max-order> |
323 | 323 | ||
324 | so, the max buffer size is | 324 | so, the max buffer size is |
325 | 325 | ||
326 | <block number> * <block size> | 326 | <block number> * <block size> |
327 | 327 | ||
328 | and, the number of frames be | 328 | and, the number of frames be |
329 | 329 | ||
330 | <block number> * <block size> / <frame size> | 330 | <block number> * <block size> / <frame size> |
331 | 331 | ||
332 | Suppose the following parameters, which apply for 2.6 kernel and an | 332 | Suppose the following parameters, which apply for 2.6 kernel and an |
333 | i386 architecture: | 333 | i386 architecture: |
334 | 334 | ||
335 | <size-max> = 131072 bytes | 335 | <size-max> = 131072 bytes |
336 | <pointer size> = 4 bytes | 336 | <pointer size> = 4 bytes |
337 | <pagesize> = 4096 bytes | 337 | <pagesize> = 4096 bytes |
338 | <max-order> = 11 | 338 | <max-order> = 11 |
339 | 339 | ||
340 | and a value for <frame size> of 2048 bytes. These parameters will yield | 340 | and a value for <frame size> of 2048 bytes. These parameters will yield |
341 | 341 | ||
342 | <block number> = 131072/4 = 32768 blocks | 342 | <block number> = 131072/4 = 32768 blocks |
343 | <block size> = 4096 << 11 = 8 MiB. | 343 | <block size> = 4096 << 11 = 8 MiB. |
344 | 344 | ||
345 | and hence the buffer will have a 262144 MiB size. So it can hold | 345 | and hence the buffer will have a 262144 MiB size. So it can hold |
346 | 262144 MiB / 2048 bytes = 134217728 frames | 346 | 262144 MiB / 2048 bytes = 134217728 frames |
347 | 347 | ||
348 | Actually, this buffer size is not possible with an i386 architecture. | 348 | Actually, this buffer size is not possible with an i386 architecture. |
349 | Remember that the memory is allocated in kernel space, in the case of | 349 | Remember that the memory is allocated in kernel space, in the case of |
350 | an i386 kernel's memory size is limited to 1GiB. | 350 | an i386 kernel's memory size is limited to 1GiB. |
351 | 351 | ||
352 | All memory allocations are not freed until the socket is closed. The memory | 352 | All memory allocations are not freed until the socket is closed. The memory |
353 | allocations are done with GFP_KERNEL priority, this basically means that | 353 | allocations are done with GFP_KERNEL priority, this basically means that |
354 | the allocation can wait and swap other process' memory in order to allocate | 354 | the allocation can wait and swap other process' memory in order to allocate |
355 | the necessary memory, so normally limits can be reached. | 355 | the necessary memory, so normally limits can be reached. |
356 | 356 | ||
357 | Other constraints | 357 | Other constraints |
358 | ------------------- | 358 | ------------------- |
359 | 359 | ||
360 | If you check the source code you will see that what I draw here as a frame | 360 | If you check the source code you will see that what I draw here as a frame |
361 | is not only the link level frame. At the beginning of each frame there is a | 361 | is not only the link level frame. At the beginning of each frame there is a |
362 | header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame | 362 | header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame |
363 | meta information like timestamp. So what we draw here a frame it's really | 363 | meta information like timestamp. So what we draw here a frame it's really |
364 | the following (from include/linux/if_packet.h): | 364 | the following (from include/linux/if_packet.h): |
365 | 365 | ||
366 | /* | 366 | /* |
367 | Frame structure: | 367 | Frame structure: |
368 | 368 | ||
369 | - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 | 369 | - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 |
370 | - struct tpacket_hdr | 370 | - struct tpacket_hdr |
371 | - pad to TPACKET_ALIGNMENT=16 | 371 | - pad to TPACKET_ALIGNMENT=16 |
372 | - struct sockaddr_ll | 372 | - struct sockaddr_ll |
373 | - Gap, chosen so that packet data (Start+tp_net) aligns to | 373 | - Gap, chosen so that packet data (Start+tp_net) aligns to |
374 | TPACKET_ALIGNMENT=16 | 374 | TPACKET_ALIGNMENT=16 |
375 | - Start+tp_mac: [ Optional MAC header ] | 375 | - Start+tp_mac: [ Optional MAC header ] |
376 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. | 376 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. |
377 | - Pad to align to TPACKET_ALIGNMENT=16 | 377 | - Pad to align to TPACKET_ALIGNMENT=16 |
378 | */ | 378 | */ |
379 | 379 | ||
380 | The following are conditions that are checked in packet_set_ring | 380 | The following are conditions that are checked in packet_set_ring |
381 | 381 | ||
382 | tp_block_size must be a multiple of PAGE_SIZE (1) | 382 | tp_block_size must be a multiple of PAGE_SIZE (1) |
383 | tp_frame_size must be greater than TPACKET_HDRLEN (obvious) | 383 | tp_frame_size must be greater than TPACKET_HDRLEN (obvious) |
384 | tp_frame_size must be a multiple of TPACKET_ALIGNMENT | 384 | tp_frame_size must be a multiple of TPACKET_ALIGNMENT |
385 | tp_frame_nr must be exactly frames_per_block*tp_block_nr | 385 | tp_frame_nr must be exactly frames_per_block*tp_block_nr |
386 | 386 | ||
387 | Note that tp_block_size should be chosen to be a power of two or there will | 387 | Note that tp_block_size should be chosen to be a power of two or there will |
388 | be a waste of memory. | 388 | be a waste of memory. |
389 | 389 | ||
390 | -------------------------------------------------------------------------------- | 390 | -------------------------------------------------------------------------------- |
391 | + Mapping and use of the circular buffer (ring) | 391 | + Mapping and use of the circular buffer (ring) |
392 | -------------------------------------------------------------------------------- | 392 | -------------------------------------------------------------------------------- |
393 | 393 | ||
394 | The mapping of the buffer in the user process is done with the conventional | 394 | The mapping of the buffer in the user process is done with the conventional |
395 | mmap function. Even the circular buffer is compound of several physically | 395 | mmap function. Even the circular buffer is compound of several physically |
396 | discontiguous blocks of memory, they are contiguous to the user space, hence | 396 | discontiguous blocks of memory, they are contiguous to the user space, hence |
397 | just one call to mmap is needed: | 397 | just one call to mmap is needed: |
398 | 398 | ||
399 | mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); | 399 | mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); |
400 | 400 | ||
401 | If tp_frame_size is a divisor of tp_block_size frames will be | 401 | If tp_frame_size is a divisor of tp_block_size frames will be |
402 | contiguously spaced by tp_frame_size bytes. If not, each | 402 | contiguously spaced by tp_frame_size bytes. If not, each |
403 | tp_block_size/tp_frame_size frames there will be a gap between | 403 | tp_block_size/tp_frame_size frames there will be a gap between |
404 | the frames. This is because a frame cannot be spawn across two | 404 | the frames. This is because a frame cannot be spawn across two |
405 | blocks. | 405 | blocks. |
406 | 406 | ||
407 | At the beginning of each frame there is an status field (see | 407 | At the beginning of each frame there is an status field (see |
408 | struct tpacket_hdr). If this field is 0 means that the frame is ready | 408 | struct tpacket_hdr). If this field is 0 means that the frame is ready |
409 | to be used for the kernel, If not, there is a frame the user can read | 409 | to be used for the kernel, If not, there is a frame the user can read |
410 | and the following flags apply: | 410 | and the following flags apply: |
411 | 411 | ||
412 | +++ Capture process: | 412 | +++ Capture process: |
413 | from include/linux/if_packet.h | 413 | from include/linux/if_packet.h |
414 | 414 | ||
415 | #define TP_STATUS_COPY 2 | 415 | #define TP_STATUS_COPY 2 |
416 | #define TP_STATUS_LOSING 4 | 416 | #define TP_STATUS_LOSING 4 |
417 | #define TP_STATUS_CSUMNOTREADY 8 | 417 | #define TP_STATUS_CSUMNOTREADY 8 |
418 | 418 | ||
419 | TP_STATUS_COPY : This flag indicates that the frame (and associated | 419 | TP_STATUS_COPY : This flag indicates that the frame (and associated |
420 | meta information) has been truncated because it's | 420 | meta information) has been truncated because it's |
421 | larger than tp_frame_size. This packet can be | 421 | larger than tp_frame_size. This packet can be |
422 | read entirely with recvfrom(). | 422 | read entirely with recvfrom(). |
423 | 423 | ||
424 | In order to make this work it must to be | 424 | In order to make this work it must to be |
425 | enabled previously with setsockopt() and | 425 | enabled previously with setsockopt() and |
426 | the PACKET_COPY_THRESH option. | 426 | the PACKET_COPY_THRESH option. |
427 | 427 | ||
428 | The number of frames than can be buffered to | 428 | The number of frames than can be buffered to |
429 | be read with recvfrom is limited like a normal socket. | 429 | be read with recvfrom is limited like a normal socket. |
430 | See the SO_RCVBUF option in the socket (7) man page. | 430 | See the SO_RCVBUF option in the socket (7) man page. |
431 | 431 | ||
432 | TP_STATUS_LOSING : indicates there were packet drops from last time | 432 | TP_STATUS_LOSING : indicates there were packet drops from last time |
433 | statistics where checked with getsockopt() and | 433 | statistics where checked with getsockopt() and |
434 | the PACKET_STATISTICS option. | 434 | the PACKET_STATISTICS option. |
435 | 435 | ||
436 | TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which | 436 | TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which |
437 | its checksum will be done in hardware. So while | 437 | its checksum will be done in hardware. So while |
438 | reading the packet we should not try to check the | 438 | reading the packet we should not try to check the |
439 | checksum. | 439 | checksum. |
440 | 440 | ||
441 | for convenience there are also the following defines: | 441 | for convenience there are also the following defines: |
442 | 442 | ||
443 | #define TP_STATUS_KERNEL 0 | 443 | #define TP_STATUS_KERNEL 0 |
444 | #define TP_STATUS_USER 1 | 444 | #define TP_STATUS_USER 1 |
445 | 445 | ||
446 | The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel | 446 | The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel |
447 | receives a packet it puts in the buffer and updates the status with | 447 | receives a packet it puts in the buffer and updates the status with |
448 | at least the TP_STATUS_USER flag. Then the user can read the packet, | 448 | at least the TP_STATUS_USER flag. Then the user can read the packet, |
449 | once the packet is read the user must zero the status field, so the kernel | 449 | once the packet is read the user must zero the status field, so the kernel |
450 | can use again that frame buffer. | 450 | can use again that frame buffer. |
451 | 451 | ||
452 | The user can use poll (any other variant should apply too) to check if new | 452 | The user can use poll (any other variant should apply too) to check if new |
453 | packets are in the ring: | 453 | packets are in the ring: |
454 | 454 | ||
455 | struct pollfd pfd; | 455 | struct pollfd pfd; |
456 | 456 | ||
457 | pfd.fd = fd; | 457 | pfd.fd = fd; |
458 | pfd.revents = 0; | 458 | pfd.revents = 0; |
459 | pfd.events = POLLIN|POLLRDNORM|POLLERR; | 459 | pfd.events = POLLIN|POLLRDNORM|POLLERR; |
460 | 460 | ||
461 | if (status == TP_STATUS_KERNEL) | 461 | if (status == TP_STATUS_KERNEL) |
462 | retval = poll(&pfd, 1, timeout); | 462 | retval = poll(&pfd, 1, timeout); |
463 | 463 | ||
464 | It doesn't incur in a race condition to first check the status value and | 464 | It doesn't incur in a race condition to first check the status value and |
465 | then poll for frames. | 465 | then poll for frames. |
466 | 466 | ||
467 | ++ Transmission process | 467 | ++ Transmission process |
468 | Those defines are also used for transmission: | 468 | Those defines are also used for transmission: |
469 | 469 | ||
470 | #define TP_STATUS_AVAILABLE 0 // Frame is available | 470 | #define TP_STATUS_AVAILABLE 0 // Frame is available |
471 | #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() | 471 | #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() |
472 | #define TP_STATUS_SENDING 2 // Frame is currently in transmission | 472 | #define TP_STATUS_SENDING 2 // Frame is currently in transmission |
473 | #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct | 473 | #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct |
474 | 474 | ||
475 | First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a | 475 | First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a |
476 | packet, the user fills a data buffer of an available frame, sets tp_len to | 476 | packet, the user fills a data buffer of an available frame, sets tp_len to |
477 | current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. | 477 | current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. |
478 | This can be done on multiple frames. Once the user is ready to transmit, it | 478 | This can be done on multiple frames. Once the user is ready to transmit, it |
479 | calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are | 479 | calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are |
480 | forwarded to the network device. The kernel updates each status of sent | 480 | forwarded to the network device. The kernel updates each status of sent |
481 | frames with TP_STATUS_SENDING until the end of transfer. | 481 | frames with TP_STATUS_SENDING until the end of transfer. |
482 | At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. | 482 | At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. |
483 | 483 | ||
484 | header->tp_len = in_i_size; | 484 | header->tp_len = in_i_size; |
485 | header->tp_status = TP_STATUS_SEND_REQUEST; | 485 | header->tp_status = TP_STATUS_SEND_REQUEST; |
486 | retval = send(this->socket, NULL, 0, 0); | 486 | retval = send(this->socket, NULL, 0, 0); |
487 | 487 | ||
488 | The user can also use poll() to check if a buffer is available: | 488 | The user can also use poll() to check if a buffer is available: |
489 | (status == TP_STATUS_SENDING) | 489 | (status == TP_STATUS_SENDING) |
490 | 490 | ||
491 | struct pollfd pfd; | 491 | struct pollfd pfd; |
492 | pfd.fd = fd; | 492 | pfd.fd = fd; |
493 | pfd.revents = 0; | 493 | pfd.revents = 0; |
494 | pfd.events = POLLOUT; | 494 | pfd.events = POLLOUT; |
495 | retval = poll(&pfd, 1, timeout); | 495 | retval = poll(&pfd, 1, timeout); |
496 | 496 | ||
497 | ------------------------------------------------------------------------------- | 497 | ------------------------------------------------------------------------------- |
498 | + What TPACKET versions are available and when to use them? | 498 | + What TPACKET versions are available and when to use them? |
499 | ------------------------------------------------------------------------------- | 499 | ------------------------------------------------------------------------------- |
500 | 500 | ||
501 | int val = tpacket_version; | 501 | int val = tpacket_version; |
502 | setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | 502 | setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); |
503 | getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | 503 | getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); |
504 | 504 | ||
505 | where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. | 505 | where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. |
506 | 506 | ||
507 | TPACKET_V1: | 507 | TPACKET_V1: |
508 | - Default if not otherwise specified by setsockopt(2) | 508 | - Default if not otherwise specified by setsockopt(2) |
509 | - RX_RING, TX_RING available | 509 | - RX_RING, TX_RING available |
510 | - VLAN metadata information available for packets | 510 | - VLAN metadata information available for packets |
511 | (TP_STATUS_VLAN_VALID) | 511 | (TP_STATUS_VLAN_VALID) |
512 | 512 | ||
513 | TPACKET_V1 --> TPACKET_V2: | 513 | TPACKET_V1 --> TPACKET_V2: |
514 | - Made 64 bit clean due to unsigned long usage in TPACKET_V1 | 514 | - Made 64 bit clean due to unsigned long usage in TPACKET_V1 |
515 | structures, thus this also works on 64 bit kernel with 32 bit | 515 | structures, thus this also works on 64 bit kernel with 32 bit |
516 | userspace and the like | 516 | userspace and the like |
517 | - Timestamp resolution in nanoseconds instead of microseconds | 517 | - Timestamp resolution in nanoseconds instead of microseconds |
518 | - RX_RING, TX_RING available | 518 | - RX_RING, TX_RING available |
519 | - How to switch to TPACKET_V2: | 519 | - How to switch to TPACKET_V2: |
520 | 1. Replace struct tpacket_hdr by struct tpacket2_hdr | 520 | 1. Replace struct tpacket_hdr by struct tpacket2_hdr |
521 | 2. Query header len and save | 521 | 2. Query header len and save |
522 | 3. Set protocol version to 2, set up ring as usual | 522 | 3. Set protocol version to 2, set up ring as usual |
523 | 4. For getting the sockaddr_ll, | 523 | 4. For getting the sockaddr_ll, |
524 | use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of | 524 | use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of |
525 | (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | 525 | (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) |
526 | 526 | ||
527 | TPACKET_V2 --> TPACKET_V3: | 527 | TPACKET_V2 --> TPACKET_V3: |
528 | - Flexible buffer implementation: | 528 | - Flexible buffer implementation: |
529 | 1. Blocks can be configured with non-static frame-size | 529 | 1. Blocks can be configured with non-static frame-size |
530 | 2. Read/poll is at a block-level (as opposed to packet-level) | 530 | 2. Read/poll is at a block-level (as opposed to packet-level) |
531 | 3. Added poll timeout to avoid indefinite user-space wait | 531 | 3. Added poll timeout to avoid indefinite user-space wait |
532 | on idle links | 532 | on idle links |
533 | 4. Added user-configurable knobs: | 533 | 4. Added user-configurable knobs: |
534 | 4.1 block::timeout | 534 | 4.1 block::timeout |
535 | 4.2 tpkt_hdr::sk_rxhash | 535 | 4.2 tpkt_hdr::sk_rxhash |
536 | - RX Hash data available in user space | 536 | - RX Hash data available in user space |
537 | - Currently only RX_RING available | 537 | - Currently only RX_RING available |
538 | 538 | ||
539 | ------------------------------------------------------------------------------- | 539 | ------------------------------------------------------------------------------- |
540 | + AF_PACKET fanout mode | 540 | + AF_PACKET fanout mode |
541 | ------------------------------------------------------------------------------- | 541 | ------------------------------------------------------------------------------- |
542 | 542 | ||
543 | In the AF_PACKET fanout mode, packet reception can be load balanced among | 543 | In the AF_PACKET fanout mode, packet reception can be load balanced among |
544 | processes. This also works in combination with mmap(2) on packet sockets. | 544 | processes. This also works in combination with mmap(2) on packet sockets. |
545 | 545 | ||
546 | Minimal example code by David S. Miller (try things like "./test eth0 hash", | 546 | Minimal example code by David S. Miller (try things like "./test eth0 hash", |
547 | "./test eth0 lb", etc.): | 547 | "./test eth0 lb", etc.): |
548 | 548 | ||
549 | #include <stddef.h> | 549 | #include <stddef.h> |
550 | #include <stdlib.h> | 550 | #include <stdlib.h> |
551 | #include <stdio.h> | 551 | #include <stdio.h> |
552 | #include <string.h> | 552 | #include <string.h> |
553 | 553 | ||
554 | #include <sys/types.h> | 554 | #include <sys/types.h> |
555 | #include <sys/wait.h> | 555 | #include <sys/wait.h> |
556 | #include <sys/socket.h> | 556 | #include <sys/socket.h> |
557 | #include <sys/ioctl.h> | 557 | #include <sys/ioctl.h> |
558 | 558 | ||
559 | #include <unistd.h> | 559 | #include <unistd.h> |
560 | 560 | ||
561 | #include <linux/if_ether.h> | 561 | #include <linux/if_ether.h> |
562 | #include <linux/if_packet.h> | 562 | #include <linux/if_packet.h> |
563 | 563 | ||
564 | #include <net/if.h> | 564 | #include <net/if.h> |
565 | 565 | ||
566 | static const char *device_name; | 566 | static const char *device_name; |
567 | static int fanout_type; | 567 | static int fanout_type; |
568 | static int fanout_id; | 568 | static int fanout_id; |
569 | 569 | ||
570 | #ifndef PACKET_FANOUT | 570 | #ifndef PACKET_FANOUT |
571 | # define PACKET_FANOUT 18 | 571 | # define PACKET_FANOUT 18 |
572 | # define PACKET_FANOUT_HASH 0 | 572 | # define PACKET_FANOUT_HASH 0 |
573 | # define PACKET_FANOUT_LB 1 | 573 | # define PACKET_FANOUT_LB 1 |
574 | #endif | 574 | #endif |
575 | 575 | ||
576 | static int setup_socket(void) | 576 | static int setup_socket(void) |
577 | { | 577 | { |
578 | int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); | 578 | int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); |
579 | struct sockaddr_ll ll; | 579 | struct sockaddr_ll ll; |
580 | struct ifreq ifr; | 580 | struct ifreq ifr; |
581 | int fanout_arg; | 581 | int fanout_arg; |
582 | 582 | ||
583 | if (fd < 0) { | 583 | if (fd < 0) { |
584 | perror("socket"); | 584 | perror("socket"); |
585 | return EXIT_FAILURE; | 585 | return EXIT_FAILURE; |
586 | } | 586 | } |
587 | 587 | ||
588 | memset(&ifr, 0, sizeof(ifr)); | 588 | memset(&ifr, 0, sizeof(ifr)); |
589 | strcpy(ifr.ifr_name, device_name); | 589 | strcpy(ifr.ifr_name, device_name); |
590 | err = ioctl(fd, SIOCGIFINDEX, &ifr); | 590 | err = ioctl(fd, SIOCGIFINDEX, &ifr); |
591 | if (err < 0) { | 591 | if (err < 0) { |
592 | perror("SIOCGIFINDEX"); | 592 | perror("SIOCGIFINDEX"); |
593 | return EXIT_FAILURE; | 593 | return EXIT_FAILURE; |
594 | } | 594 | } |
595 | 595 | ||
596 | memset(&ll, 0, sizeof(ll)); | 596 | memset(&ll, 0, sizeof(ll)); |
597 | ll.sll_family = AF_PACKET; | 597 | ll.sll_family = AF_PACKET; |
598 | ll.sll_ifindex = ifr.ifr_ifindex; | 598 | ll.sll_ifindex = ifr.ifr_ifindex; |
599 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); | 599 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); |
600 | if (err < 0) { | 600 | if (err < 0) { |
601 | perror("bind"); | 601 | perror("bind"); |
602 | return EXIT_FAILURE; | 602 | return EXIT_FAILURE; |
603 | } | 603 | } |
604 | 604 | ||
605 | fanout_arg = (fanout_id | (fanout_type << 16)); | 605 | fanout_arg = (fanout_id | (fanout_type << 16)); |
606 | err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, | 606 | err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, |
607 | &fanout_arg, sizeof(fanout_arg)); | 607 | &fanout_arg, sizeof(fanout_arg)); |
608 | if (err) { | 608 | if (err) { |
609 | perror("setsockopt"); | 609 | perror("setsockopt"); |
610 | return EXIT_FAILURE; | 610 | return EXIT_FAILURE; |
611 | } | 611 | } |
612 | 612 | ||
613 | return fd; | 613 | return fd; |
614 | } | 614 | } |
615 | 615 | ||
616 | static void fanout_thread(void) | 616 | static void fanout_thread(void) |
617 | { | 617 | { |
618 | int fd = setup_socket(); | 618 | int fd = setup_socket(); |
619 | int limit = 10000; | 619 | int limit = 10000; |
620 | 620 | ||
621 | if (fd < 0) | 621 | if (fd < 0) |
622 | exit(fd); | 622 | exit(fd); |
623 | 623 | ||
624 | while (limit-- > 0) { | 624 | while (limit-- > 0) { |
625 | char buf[1600]; | 625 | char buf[1600]; |
626 | int err; | 626 | int err; |
627 | 627 | ||
628 | err = read(fd, buf, sizeof(buf)); | 628 | err = read(fd, buf, sizeof(buf)); |
629 | if (err < 0) { | 629 | if (err < 0) { |
630 | perror("read"); | 630 | perror("read"); |
631 | exit(EXIT_FAILURE); | 631 | exit(EXIT_FAILURE); |
632 | } | 632 | } |
633 | if ((limit % 10) == 0) | 633 | if ((limit % 10) == 0) |
634 | fprintf(stdout, "(%d) \n", getpid()); | 634 | fprintf(stdout, "(%d) \n", getpid()); |
635 | } | 635 | } |
636 | 636 | ||
637 | fprintf(stdout, "%d: Received 10000 packets\n", getpid()); | 637 | fprintf(stdout, "%d: Received 10000 packets\n", getpid()); |
638 | 638 | ||
639 | close(fd); | 639 | close(fd); |
640 | exit(0); | 640 | exit(0); |
641 | } | 641 | } |
642 | 642 | ||
643 | int main(int argc, char **argp) | 643 | int main(int argc, char **argp) |
644 | { | 644 | { |
645 | int fd, err; | 645 | int fd, err; |
646 | int i; | 646 | int i; |
647 | 647 | ||
648 | if (argc != 3) { | 648 | if (argc != 3) { |
649 | fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); | 649 | fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); |
650 | return EXIT_FAILURE; | 650 | return EXIT_FAILURE; |
651 | } | 651 | } |
652 | 652 | ||
653 | if (!strcmp(argp[2], "hash")) | 653 | if (!strcmp(argp[2], "hash")) |
654 | fanout_type = PACKET_FANOUT_HASH; | 654 | fanout_type = PACKET_FANOUT_HASH; |
655 | else if (!strcmp(argp[2], "lb")) | 655 | else if (!strcmp(argp[2], "lb")) |
656 | fanout_type = PACKET_FANOUT_LB; | 656 | fanout_type = PACKET_FANOUT_LB; |
657 | else { | 657 | else { |
658 | fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); | 658 | fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); |
659 | exit(EXIT_FAILURE); | 659 | exit(EXIT_FAILURE); |
660 | } | 660 | } |
661 | 661 | ||
662 | device_name = argp[1]; | 662 | device_name = argp[1]; |
663 | fanout_id = getpid() & 0xffff; | 663 | fanout_id = getpid() & 0xffff; |
664 | 664 | ||
665 | for (i = 0; i < 4; i++) { | 665 | for (i = 0; i < 4; i++) { |
666 | pid_t pid = fork(); | 666 | pid_t pid = fork(); |
667 | 667 | ||
668 | switch (pid) { | 668 | switch (pid) { |
669 | case 0: | 669 | case 0: |
670 | fanout_thread(); | 670 | fanout_thread(); |
671 | 671 | ||
672 | case -1: | 672 | case -1: |
673 | perror("fork"); | 673 | perror("fork"); |
674 | exit(EXIT_FAILURE); | 674 | exit(EXIT_FAILURE); |
675 | } | 675 | } |
676 | } | 676 | } |
677 | 677 | ||
678 | for (i = 0; i < 4; i++) { | 678 | for (i = 0; i < 4; i++) { |
679 | int status; | 679 | int status; |
680 | 680 | ||
681 | wait(&status); | 681 | wait(&status); |
682 | } | 682 | } |
683 | 683 | ||
684 | return 0; | 684 | return 0; |
685 | } | 685 | } |
686 | 686 | ||
687 | ------------------------------------------------------------------------------- | 687 | ------------------------------------------------------------------------------- |
688 | + AF_PACKET TPACKET_V3 example | 688 | + AF_PACKET TPACKET_V3 example |
689 | ------------------------------------------------------------------------------- | 689 | ------------------------------------------------------------------------------- |
690 | 690 | ||
691 | AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame | 691 | AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame |
692 | sizes by doing it's own memory management. It is based on blocks where polling | 692 | sizes by doing it's own memory management. It is based on blocks where polling |
693 | works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. | 693 | works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. |
694 | 694 | ||
695 | It is said that TPACKET_V3 brings the following benefits: | 695 | It is said that TPACKET_V3 brings the following benefits: |
696 | *) ~15 - 20% reduction in CPU-usage | 696 | *) ~15 - 20% reduction in CPU-usage |
697 | *) ~20% increase in packet capture rate | 697 | *) ~20% increase in packet capture rate |
698 | *) ~2x increase in packet density | 698 | *) ~2x increase in packet density |
699 | *) Port aggregation analysis | 699 | *) Port aggregation analysis |
700 | *) Non static frame size to capture entire packet payload | 700 | *) Non static frame size to capture entire packet payload |
701 | 701 | ||
702 | So it seems to be a good candidate to be used with packet fanout. | 702 | So it seems to be a good candidate to be used with packet fanout. |
703 | 703 | ||
704 | Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile | 704 | Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile |
705 | it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): | 705 | it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): |
706 | 706 | ||
707 | #include <stdio.h> | 707 | #include <stdio.h> |
708 | #include <stdlib.h> | 708 | #include <stdlib.h> |
709 | #include <stdint.h> | 709 | #include <stdint.h> |
710 | #include <string.h> | 710 | #include <string.h> |
711 | #include <assert.h> | 711 | #include <assert.h> |
712 | #include <net/if.h> | 712 | #include <net/if.h> |
713 | #include <arpa/inet.h> | 713 | #include <arpa/inet.h> |
714 | #include <netdb.h> | 714 | #include <netdb.h> |
715 | #include <poll.h> | 715 | #include <poll.h> |
716 | #include <unistd.h> | 716 | #include <unistd.h> |
717 | #include <signal.h> | 717 | #include <signal.h> |
718 | #include <inttypes.h> | 718 | #include <inttypes.h> |
719 | #include <sys/socket.h> | 719 | #include <sys/socket.h> |
720 | #include <sys/mman.h> | 720 | #include <sys/mman.h> |
721 | #include <linux/if_packet.h> | 721 | #include <linux/if_packet.h> |
722 | #include <linux/if_ether.h> | 722 | #include <linux/if_ether.h> |
723 | #include <linux/ip.h> | 723 | #include <linux/ip.h> |
724 | 724 | ||
725 | #define BLOCK_SIZE (1 << 22) | 725 | #define BLOCK_SIZE (1 << 22) |
726 | #define FRAME_SIZE 2048 | 726 | #define FRAME_SIZE 2048 |
727 | 727 | ||
728 | #define NUM_BLOCKS 64 | 728 | #define NUM_BLOCKS 64 |
729 | #define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE) | 729 | #define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE) |
730 | 730 | ||
731 | #define BLOCK_RETIRE_TOV_IN_MS 64 | 731 | #define BLOCK_RETIRE_TOV_IN_MS 64 |
732 | #define BLOCK_PRIV_AREA_SZ 13 | 732 | #define BLOCK_PRIV_AREA_SZ 13 |
733 | 733 | ||
734 | #define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1)) | 734 | #define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1)) |
735 | 735 | ||
736 | #define BLOCK_STATUS(x) ((x)->h1.block_status) | 736 | #define BLOCK_STATUS(x) ((x)->h1.block_status) |
737 | #define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts) | 737 | #define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts) |
738 | #define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt) | 738 | #define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt) |
739 | #define BLOCK_LEN(x) ((x)->h1.blk_len) | 739 | #define BLOCK_LEN(x) ((x)->h1.blk_len) |
740 | #define BLOCK_SNUM(x) ((x)->h1.seq_num) | 740 | #define BLOCK_SNUM(x) ((x)->h1.seq_num) |
741 | #define BLOCK_O2PRIV(x) ((x)->offset_to_priv) | 741 | #define BLOCK_O2PRIV(x) ((x)->offset_to_priv) |
742 | #define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x))) | 742 | #define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x))) |
743 | #define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc))) | 743 | #define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc))) |
744 | #define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri))) | 744 | #define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri))) |
745 | 745 | ||
746 | #ifndef likely | 746 | #ifndef likely |
747 | # define likely(x) __builtin_expect(!!(x), 1) | 747 | # define likely(x) __builtin_expect(!!(x), 1) |
748 | #endif | 748 | #endif |
749 | #ifndef unlikely | 749 | #ifndef unlikely |
750 | # define unlikely(x) __builtin_expect(!!(x), 0) | 750 | # define unlikely(x) __builtin_expect(!!(x), 0) |
751 | #endif | 751 | #endif |
752 | 752 | ||
753 | struct block_desc { | 753 | struct block_desc { |
754 | uint32_t version; | 754 | uint32_t version; |
755 | uint32_t offset_to_priv; | 755 | uint32_t offset_to_priv; |
756 | struct tpacket_hdr_v1 h1; | 756 | struct tpacket_hdr_v1 h1; |
757 | }; | 757 | }; |
758 | 758 | ||
759 | struct ring { | 759 | struct ring { |
760 | struct iovec *rd; | 760 | struct iovec *rd; |
761 | uint8_t *map; | 761 | uint8_t *map; |
762 | struct tpacket_req3 req; | 762 | struct tpacket_req3 req; |
763 | }; | 763 | }; |
764 | 764 | ||
765 | static unsigned long packets_total = 0, bytes_total = 0; | 765 | static unsigned long packets_total = 0, bytes_total = 0; |
766 | static sig_atomic_t sigint = 0; | 766 | static sig_atomic_t sigint = 0; |
767 | 767 | ||
768 | void sighandler(int num) | 768 | void sighandler(int num) |
769 | { | 769 | { |
770 | sigint = 1; | 770 | sigint = 1; |
771 | } | 771 | } |
772 | 772 | ||
773 | static int setup_socket(struct ring *ring, char *netdev) | 773 | static int setup_socket(struct ring *ring, char *netdev) |
774 | { | 774 | { |
775 | int err, i, fd, v = TPACKET_V3; | 775 | int err, i, fd, v = TPACKET_V3; |
776 | struct sockaddr_ll ll; | 776 | struct sockaddr_ll ll; |
777 | 777 | ||
778 | fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); | 778 | fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); |
779 | if (fd < 0) { | 779 | if (fd < 0) { |
780 | perror("socket"); | 780 | perror("socket"); |
781 | exit(1); | 781 | exit(1); |
782 | } | 782 | } |
783 | 783 | ||
784 | err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); | 784 | err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); |
785 | if (err < 0) { | 785 | if (err < 0) { |
786 | perror("setsockopt"); | 786 | perror("setsockopt"); |
787 | exit(1); | 787 | exit(1); |
788 | } | 788 | } |
789 | 789 | ||
790 | memset(&ring->req, 0, sizeof(ring->req)); | 790 | memset(&ring->req, 0, sizeof(ring->req)); |
791 | ring->req.tp_block_size = BLOCK_SIZE; | 791 | ring->req.tp_block_size = BLOCK_SIZE; |
792 | ring->req.tp_frame_size = FRAME_SIZE; | 792 | ring->req.tp_frame_size = FRAME_SIZE; |
793 | ring->req.tp_block_nr = NUM_BLOCKS; | 793 | ring->req.tp_block_nr = NUM_BLOCKS; |
794 | ring->req.tp_frame_nr = NUM_FRAMES; | 794 | ring->req.tp_frame_nr = NUM_FRAMES; |
795 | ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS; | 795 | ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS; |
796 | ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ; | 796 | ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ; |
797 | ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH; | 797 | ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH; |
798 | 798 | ||
799 | err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, | 799 | err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, |
800 | sizeof(ring->req)); | 800 | sizeof(ring->req)); |
801 | if (err < 0) { | 801 | if (err < 0) { |
802 | perror("setsockopt"); | 802 | perror("setsockopt"); |
803 | exit(1); | 803 | exit(1); |
804 | } | 804 | } |
805 | 805 | ||
806 | ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, | 806 | ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, |
807 | PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, | 807 | PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, |
808 | fd, 0); | 808 | fd, 0); |
809 | if (ring->map == MAP_FAILED) { | 809 | if (ring->map == MAP_FAILED) { |
810 | perror("mmap"); | 810 | perror("mmap"); |
811 | exit(1); | 811 | exit(1); |
812 | } | 812 | } |
813 | 813 | ||
814 | ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); | 814 | ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); |
815 | assert(ring->rd); | 815 | assert(ring->rd); |
816 | for (i = 0; i < ring->req.tp_block_nr; ++i) { | 816 | for (i = 0; i < ring->req.tp_block_nr; ++i) { |
817 | ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); | 817 | ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); |
818 | ring->rd[i].iov_len = ring->req.tp_block_size; | 818 | ring->rd[i].iov_len = ring->req.tp_block_size; |
819 | } | 819 | } |
820 | 820 | ||
821 | memset(&ll, 0, sizeof(ll)); | 821 | memset(&ll, 0, sizeof(ll)); |
822 | ll.sll_family = PF_PACKET; | 822 | ll.sll_family = PF_PACKET; |
823 | ll.sll_protocol = htons(ETH_P_ALL); | 823 | ll.sll_protocol = htons(ETH_P_ALL); |
824 | ll.sll_ifindex = if_nametoindex(netdev); | 824 | ll.sll_ifindex = if_nametoindex(netdev); |
825 | ll.sll_hatype = 0; | 825 | ll.sll_hatype = 0; |
826 | ll.sll_pkttype = 0; | 826 | ll.sll_pkttype = 0; |
827 | ll.sll_halen = 0; | 827 | ll.sll_halen = 0; |
828 | 828 | ||
829 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); | 829 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); |
830 | if (err < 0) { | 830 | if (err < 0) { |
831 | perror("bind"); | 831 | perror("bind"); |
832 | exit(1); | 832 | exit(1); |
833 | } | 833 | } |
834 | 834 | ||
835 | return fd; | 835 | return fd; |
836 | } | 836 | } |
837 | 837 | ||
838 | #ifdef __checked | 838 | #ifdef __checked |
839 | static uint64_t prev_block_seq_num = 0; | 839 | static uint64_t prev_block_seq_num = 0; |
840 | 840 | ||
841 | void assert_block_seq_num(struct block_desc *pbd) | 841 | void assert_block_seq_num(struct block_desc *pbd) |
842 | { | 842 | { |
843 | if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) { | 843 | if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) { |
844 | printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != " | 844 | printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != " |
845 | "actual seq:%"PRIu64"\n", prev_block_seq_num, | 845 | "actual seq:%"PRIu64"\n", prev_block_seq_num, |
846 | prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd)); | 846 | prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd)); |
847 | exit(1); | 847 | exit(1); |
848 | } | 848 | } |
849 | 849 | ||
850 | prev_block_seq_num = BLOCK_SNUM(pbd); | 850 | prev_block_seq_num = BLOCK_SNUM(pbd); |
851 | } | 851 | } |
852 | 852 | ||
853 | static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num) | 853 | static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num) |
854 | { | 854 | { |
855 | if (BLOCK_NUM_PKTS(pbd)) { | 855 | if (BLOCK_NUM_PKTS(pbd)) { |
856 | if (unlikely(bytes != BLOCK_LEN(pbd))) { | 856 | if (unlikely(bytes != BLOCK_LEN(pbd))) { |
857 | printf("block:%u with %upackets, expected len:%u != actual len:%u\n", | 857 | printf("block:%u with %upackets, expected len:%u != actual len:%u\n", |
858 | block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd)); | 858 | block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd)); |
859 | exit(1); | 859 | exit(1); |
860 | } | 860 | } |
861 | } else { | 861 | } else { |
862 | if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) { | 862 | if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) { |
863 | printf("block:%u, expected len:%lu != actual len:%u\n", | 863 | printf("block:%u, expected len:%lu != actual len:%u\n", |
864 | block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd)); | 864 | block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd)); |
865 | exit(1); | 865 | exit(1); |
866 | } | 866 | } |
867 | } | 867 | } |
868 | } | 868 | } |
869 | 869 | ||
870 | static void assert_block_header(struct block_desc *pbd, const int block_num) | 870 | static void assert_block_header(struct block_desc *pbd, const int block_num) |
871 | { | 871 | { |
872 | uint32_t block_status = BLOCK_STATUS(pbd); | 872 | uint32_t block_status = BLOCK_STATUS(pbd); |
873 | 873 | ||
874 | if (unlikely((block_status & TP_STATUS_USER) == 0)) { | 874 | if (unlikely((block_status & TP_STATUS_USER) == 0)) { |
875 | printf("block:%u, not in TP_STATUS_USER\n", block_num); | 875 | printf("block:%u, not in TP_STATUS_USER\n", block_num); |
876 | exit(1); | 876 | exit(1); |
877 | } | 877 | } |
878 | 878 | ||
879 | assert_block_seq_num(pbd); | 879 | assert_block_seq_num(pbd); |
880 | } | 880 | } |
881 | #else | 881 | #else |
882 | static inline void assert_block_header(struct block_desc *pbd, const int block_num) | 882 | static inline void assert_block_header(struct block_desc *pbd, const int block_num) |
883 | { | 883 | { |
884 | } | 884 | } |
885 | static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num) | 885 | static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num) |
886 | { | 886 | { |
887 | } | 887 | } |
888 | #endif | 888 | #endif |
889 | 889 | ||
890 | static void display(struct tpacket3_hdr *ppd) | 890 | static void display(struct tpacket3_hdr *ppd) |
891 | { | 891 | { |
892 | struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); | 892 | struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); |
893 | struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); | 893 | struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); |
894 | 894 | ||
895 | if (eth->h_proto == htons(ETH_P_IP)) { | 895 | if (eth->h_proto == htons(ETH_P_IP)) { |
896 | struct sockaddr_in ss, sd; | 896 | struct sockaddr_in ss, sd; |
897 | char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; | 897 | char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; |
898 | 898 | ||
899 | memset(&ss, 0, sizeof(ss)); | 899 | memset(&ss, 0, sizeof(ss)); |
900 | ss.sin_family = PF_INET; | 900 | ss.sin_family = PF_INET; |
901 | ss.sin_addr.s_addr = ip->saddr; | 901 | ss.sin_addr.s_addr = ip->saddr; |
902 | getnameinfo((struct sockaddr *) &ss, sizeof(ss), | 902 | getnameinfo((struct sockaddr *) &ss, sizeof(ss), |
903 | sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); | 903 | sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); |
904 | 904 | ||
905 | memset(&sd, 0, sizeof(sd)); | 905 | memset(&sd, 0, sizeof(sd)); |
906 | sd.sin_family = PF_INET; | 906 | sd.sin_family = PF_INET; |
907 | sd.sin_addr.s_addr = ip->daddr; | 907 | sd.sin_addr.s_addr = ip->daddr; |
908 | getnameinfo((struct sockaddr *) &sd, sizeof(sd), | 908 | getnameinfo((struct sockaddr *) &sd, sizeof(sd), |
909 | dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); | 909 | dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); |
910 | 910 | ||
911 | printf("%s -> %s, ", sbuff, dbuff); | 911 | printf("%s -> %s, ", sbuff, dbuff); |
912 | } | 912 | } |
913 | 913 | ||
914 | printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); | 914 | printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); |
915 | } | 915 | } |
916 | 916 | ||
917 | static void walk_block(struct block_desc *pbd, const int block_num) | 917 | static void walk_block(struct block_desc *pbd, const int block_num) |
918 | { | 918 | { |
919 | int num_pkts = BLOCK_NUM_PKTS(pbd), i; | 919 | int num_pkts = BLOCK_NUM_PKTS(pbd), i; |
920 | unsigned long bytes = 0; | 920 | unsigned long bytes = 0; |
921 | unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ); | 921 | unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ); |
922 | struct tpacket3_hdr *ppd; | 922 | struct tpacket3_hdr *ppd; |
923 | 923 | ||
924 | assert_block_header(pbd, block_num); | 924 | assert_block_header(pbd, block_num); |
925 | 925 | ||
926 | ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd)); | 926 | ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd)); |
927 | for (i = 0; i < num_pkts; ++i) { | 927 | for (i = 0; i < num_pkts; ++i) { |
928 | bytes += ppd->tp_snaplen; | 928 | bytes += ppd->tp_snaplen; |
929 | if (ppd->tp_next_offset) | 929 | if (ppd->tp_next_offset) |
930 | bytes_with_padding += ppd->tp_next_offset; | 930 | bytes_with_padding += ppd->tp_next_offset; |
931 | else | 931 | else |
932 | bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac); | 932 | bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac); |
933 | 933 | ||
934 | display(ppd); | 934 | display(ppd); |
935 | 935 | ||
936 | ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset); | 936 | ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset); |
937 | __sync_synchronize(); | 937 | __sync_synchronize(); |
938 | } | 938 | } |
939 | 939 | ||
940 | assert_block_len(pbd, bytes_with_padding, block_num); | 940 | assert_block_len(pbd, bytes_with_padding, block_num); |
941 | 941 | ||
942 | packets_total += num_pkts; | 942 | packets_total += num_pkts; |
943 | bytes_total += bytes; | 943 | bytes_total += bytes; |
944 | } | 944 | } |
945 | 945 | ||
946 | void flush_block(struct block_desc *pbd) | 946 | void flush_block(struct block_desc *pbd) |
947 | { | 947 | { |
948 | BLOCK_STATUS(pbd) = TP_STATUS_KERNEL; | 948 | BLOCK_STATUS(pbd) = TP_STATUS_KERNEL; |
949 | __sync_synchronize(); | 949 | __sync_synchronize(); |
950 | } | 950 | } |
951 | 951 | ||
952 | static void teardown_socket(struct ring *ring, int fd) | 952 | static void teardown_socket(struct ring *ring, int fd) |
953 | { | 953 | { |
954 | munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); | 954 | munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); |
955 | free(ring->rd); | 955 | free(ring->rd); |
956 | close(fd); | 956 | close(fd); |
957 | } | 957 | } |
958 | 958 | ||
959 | int main(int argc, char **argp) | 959 | int main(int argc, char **argp) |
960 | { | 960 | { |
961 | int fd, err; | 961 | int fd, err; |
962 | socklen_t len; | 962 | socklen_t len; |
963 | struct ring ring; | 963 | struct ring ring; |
964 | struct pollfd pfd; | 964 | struct pollfd pfd; |
965 | unsigned int block_num = 0; | 965 | unsigned int block_num = 0; |
966 | struct block_desc *pbd; | 966 | struct block_desc *pbd; |
967 | struct tpacket_stats_v3 stats; | 967 | struct tpacket_stats_v3 stats; |
968 | 968 | ||
969 | if (argc != 2) { | 969 | if (argc != 2) { |
970 | fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); | 970 | fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); |
971 | return EXIT_FAILURE; | 971 | return EXIT_FAILURE; |
972 | } | 972 | } |
973 | 973 | ||
974 | signal(SIGINT, sighandler); | 974 | signal(SIGINT, sighandler); |
975 | 975 | ||
976 | memset(&ring, 0, sizeof(ring)); | 976 | memset(&ring, 0, sizeof(ring)); |
977 | fd = setup_socket(&ring, argp[argc - 1]); | 977 | fd = setup_socket(&ring, argp[argc - 1]); |
978 | assert(fd > 0); | 978 | assert(fd > 0); |
979 | 979 | ||
980 | memset(&pfd, 0, sizeof(pfd)); | 980 | memset(&pfd, 0, sizeof(pfd)); |
981 | pfd.fd = fd; | 981 | pfd.fd = fd; |
982 | pfd.events = POLLIN | POLLERR; | 982 | pfd.events = POLLIN | POLLERR; |
983 | pfd.revents = 0; | 983 | pfd.revents = 0; |
984 | 984 | ||
985 | while (likely(!sigint)) { | 985 | while (likely(!sigint)) { |
986 | pbd = (struct block_desc *) ring.rd[block_num].iov_base; | 986 | pbd = (struct block_desc *) ring.rd[block_num].iov_base; |
987 | retry_block: | 987 | retry_block: |
988 | if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) { | 988 | if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) { |
989 | poll(&pfd, 1, -1); | 989 | poll(&pfd, 1, -1); |
990 | goto retry_block; | 990 | goto retry_block; |
991 | } | 991 | } |
992 | 992 | ||
993 | walk_block(pbd, block_num); | 993 | walk_block(pbd, block_num); |
994 | flush_block(pbd); | 994 | flush_block(pbd); |
995 | block_num = (block_num + 1) % NUM_BLOCKS; | 995 | block_num = (block_num + 1) % NUM_BLOCKS; |
996 | } | 996 | } |
997 | 997 | ||
998 | len = sizeof(stats); | 998 | len = sizeof(stats); |
999 | err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); | 999 | err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); |
1000 | if (err < 0) { | 1000 | if (err < 0) { |
1001 | perror("getsockopt"); | 1001 | perror("getsockopt"); |
1002 | exit(1); | 1002 | exit(1); |
1003 | } | 1003 | } |
1004 | 1004 | ||
1005 | fflush(stdout); | 1005 | fflush(stdout); |
1006 | printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", | 1006 | printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", |
1007 | stats.tp_packets, bytes_total, stats.tp_drops, | 1007 | stats.tp_packets, bytes_total, stats.tp_drops, |
1008 | stats.tp_freeze_q_cnt); | 1008 | stats.tp_freeze_q_cnt); |
1009 | 1009 | ||
1010 | teardown_socket(&ring, fd); | 1010 | teardown_socket(&ring, fd); |
1011 | return 0; | 1011 | return 0; |
1012 | } | 1012 | } |
1013 | 1013 | ||
1014 | ------------------------------------------------------------------------------- | 1014 | ------------------------------------------------------------------------------- |
1015 | + PACKET_TIMESTAMP | 1015 | + PACKET_TIMESTAMP |
1016 | ------------------------------------------------------------------------------- | 1016 | ------------------------------------------------------------------------------- |
1017 | 1017 | ||
1018 | The PACKET_TIMESTAMP setting determines the source of the timestamp in | 1018 | The PACKET_TIMESTAMP setting determines the source of the timestamp in |
1019 | the packet meta information. If your NIC is capable of timestamping | 1019 | the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your |
1020 | packets in hardware, you can request those hardware timestamps to used. | 1020 | NIC is capable of timestamping packets in hardware, you can request those |
1021 | Note: you may need to enable the generation of hardware timestamps with | 1021 | hardware timestamps to be used. Note: you may need to enable the generation |
1022 | SIOCSHWTSTAMP. | 1022 | of hardware timestamps with SIOCSHWTSTAMP (see related information from |
1023 | Documentation/networking/timestamping.txt). | ||
1023 | 1024 | ||
1024 | PACKET_TIMESTAMP accepts the same integer bit field as | 1025 | PACKET_TIMESTAMP accepts the same integer bit field as |
1025 | SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE | 1026 | SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE |
1026 | and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by | 1027 | and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by |
1027 | PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over | 1028 | PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over |
1028 | SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. | 1029 | SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. |
1029 | 1030 | ||
1030 | int req = 0; | 1031 | int req = 0; |
1031 | req |= SOF_TIMESTAMPING_SYS_HARDWARE; | 1032 | req |= SOF_TIMESTAMPING_SYS_HARDWARE; |
1032 | setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) | 1033 | setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) |
1033 | 1034 | ||
1034 | If PACKET_TIMESTAMP is not set, a software timestamp generated inside | 1035 | For the mmap(2)ed ring buffers, such timestamps are stored in the |
1035 | the networking stack is used (the behavior before this setting was added). | 1036 | tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine |
1037 | what kind of timestamp has been reported, the tp_status field is binary |'ed | ||
1038 | with the following possible bits ... | ||
1039 | |||
1040 | TP_STATUS_TS_SYS_HARDWARE | ||
1041 | TP_STATUS_TS_RAW_HARDWARE | ||
1042 | TP_STATUS_TS_SOFTWARE | ||
1043 | |||
1044 | ... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the | ||
1045 | RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set), | ||
1046 | then this means that a software fallback was invoked *within* PF_PACKET's | ||
1047 | processing code (less precise). | ||
1048 | |||
1049 | Getting timestamps for the TX_RING works as follows: i) fill the ring frames, | ||
1050 | ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant | ||
1051 | frames to be updated resp. the frame handed over to the application, iv) walk | ||
1052 | through the frames to pick up the individual hw/sw timestamps. | ||
1053 | |||
1054 | Only (!) if transmit timestamping is enabled, then these bits are combined | ||
1055 | with binary | with TP_STATUS_AVAILABLE, so you must check for that in your | ||
1056 | application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) | ||
1057 | in a first step to see if the frame belongs to the application, and then | ||
1058 | one can extract the type of timestamp in a second step from tp_status)! | ||
1059 | |||
1060 | If you don't care about them, thus having it disabled, checking for | ||
1061 | TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the | ||
1062 | TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec | ||
1063 | members do not contain a valid value. For TX_RINGs, by default no timestamp | ||
1064 | is generated! | ||
1036 | 1065 | ||
1037 | See include/linux/net_tstamp.h and Documentation/networking/timestamping | 1066 | See include/linux/net_tstamp.h and Documentation/networking/timestamping |
1038 | for more information on hardware timestamps. | 1067 | for more information on hardware timestamps. |
1039 | 1068 | ||
1040 | ------------------------------------------------------------------------------- | 1069 | ------------------------------------------------------------------------------- |
1041 | + Miscellaneous bits | 1070 | + Miscellaneous bits |
1042 | ------------------------------------------------------------------------------- | 1071 | ------------------------------------------------------------------------------- |
1043 | 1072 | ||
1044 | - Packet sockets work well together with Linux socket filters, thus you also | 1073 | - Packet sockets work well together with Linux socket filters, thus you also |
1045 | might want to have a look at Documentation/networking/filter.txt | 1074 | might want to have a look at Documentation/networking/filter.txt |
1046 | 1075 | ||
1047 | -------------------------------------------------------------------------------- | 1076 | -------------------------------------------------------------------------------- |
1048 | + THANKS | 1077 | + THANKS |
1049 | -------------------------------------------------------------------------------- | 1078 | -------------------------------------------------------------------------------- |
1050 | 1079 | ||
1051 | Jesse Brandeburg, for fixing my grammathical/spelling errors | 1080 | Jesse Brandeburg, for fixing my grammathical/spelling errors |
1052 | 1081 | ||
1053 | 1082 |