Blame view

Documentation/networking/tls-offload.rst 21.4 KB
f42c104f2   Jakub Kicinski   Documentation: ad...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
  .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
  
  ==================
  Kernel TLS offload
  ==================
  
  Kernel TLS operation
  ====================
  
  Linux kernel provides TLS connection offload infrastructure. Once a TCP
  connection is in ``ESTABLISHED`` state user space can enable the TLS Upper
  Layer Protocol (ULP) and install the cryptographic connection state.
  For details regarding the user-facing interface refer to the TLS
  documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`.
  
  ``ktls`` can operate in three modes:
  
   * Software crypto mode (``TLS_SW``) - CPU handles the cryptography.
     In most basic cases only crypto operations synchronous with the CPU
     can be used, but depending on calling context CPU may utilize
     asynchronous crypto accelerators. The use of accelerators introduces extra
     latency on socket reads (decryption only starts when a read syscall
     is made) and additional I/O load on the system.
   * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto
     on a packet by packet basis, provided the packets arrive in order.
     This mode integrates best with the kernel stack and is described in detail
     in the remaining part of this document
     (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``).
   * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where
     NIC driver and firmware replace the kernel networking stack
     with its own TCP handling, it is not usable in production environments
     making use of the Linux networking stack for example any firewalling
     abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``).
  
  The operation mode is selected automatically based on device configuration,
  offload opt-in or opt-out on per-connection basis is not currently supported.
  
  TX
  --
  
  At a high level user write requests are turned into a scatter list, the TLS ULP
  intercepts them, inserts record framing, performs encryption (in ``TLS_SW``
  mode) and then hands the modified scatter list to the TCP layer. From this
  point on the TCP stack proceeds as normal.
  
  In ``TLS_HW`` mode the encryption is not performed in the TLS ULP.
  Instead packets reach a device driver, the driver will mark the packets
  for crypto offload based on the socket the packet is attached to,
  and send them to the device for encryption and transmission.
  
  RX
  --
  
  On the receive side if the device handled decryption and authentication
  successfully, the driver will set the decrypted bit in the associated
  :c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
  are handled normally. ``ktls`` is informed when data is queued to the socket
  and the ``strparser`` mechanism is used to delineate the records. Upon read
  request, records are retrieved from the socket and passed to decryption routine.
  If device decrypted all the segments of the record the decryption is skipped,
  otherwise software path handles decryption.
  
  .. kernel-figure::  tls-offload-layers.svg
     :alt:	TLS offload layers
     :align:	center
     :figwidth:	28em
  
     Layers of Kernel TLS stack
  
  Device configuration
  ====================
  
  During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and
  ``NETIF_F_HW_TLS_TX`` features and installs its
  :c:type:`struct tlsdev_ops <tlsdev_ops>`
  pointer in the :c:member:`tlsdev_ops` member of the
  :c:type:`struct net_device <net_device>`.
  
  When TLS cryptographic connection state is installed on a ``ktls`` socket
  (note that it is done twice, once for RX and once for TX direction,
  and the two are completely independent), the kernel checks if the underlying
  network device is offload-capable and attempts the offload. In case offload
  fails the connection is handled entirely in software using the same mechanism
  as if the offload was never tried.
  
  Offload request is performed via the :c:member:`tls_dev_add` callback of
  :c:type:`struct tlsdev_ops <tlsdev_ops>`:
  
  .. code-block:: c
  
  	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
  			   enum tls_offload_ctx_dir direction,
  			   struct tls_crypto_info *crypto_info,
  			   u32 start_offload_tcp_sn);
  
  ``direction`` indicates whether the cryptographic information is for
  the received or transmitted packets. Driver uses the ``sk`` parameter
  to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).
  Cryptographic information in ``crypto_info`` includes the key, iv, salt
  as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates
  which TCP sequence number corresponds to the beginning of the record with
  sequence number from ``crypto_info``. The driver can add its state
  at the end of kernel structures (see :c:member:`driver_state` members
  in ``include/net/tls.h``) to avoid additional allocations and pointer
  dereferences.
  
  TX
  --
  
  After TX state is installed, the stack guarantees that the first segment
  of the stream will start exactly at the ``start_offload_tcp_sn`` sequence
  number, simplifying TCP sequence number matching.
  
  TX offload being fully initialized does not imply that all segments passing
  through the driver and which belong to the offloaded socket will be after
  the expected sequence number and will have kernel record information.
  In particular, already encrypted data may have been queued to the socket
  before installing the connection state in the kernel.
  
  RX
  --
  
  In RX direction local networking stack has little control over the segmentation,
  so the initial records' TCP sequence number may be anywhere inside the segment.
  
  Normal operation
  ================
  
  At the minimum the device maintains the following state for each connection, in
  each direction:
  
   * crypto secrets (key, iv, salt)
   * crypto processing state (partial blocks, partial authentication tag, etc.)
   * record metadata (sequence number, processing offset and length)
   * expected TCP sequence number
  
  There are no guarantees on record length or record segmentation. In particular
  segments may start at any point of a record and contain any number of records.
  Assuming segments are received in order, the device should be able to perform
  crypto operations and authentication regardless of segmentation. For this
  to be possible device has to keep small amount of segment-to-segment state.
  This includes at least:
  
   * partial headers (if a segment carried only a part of the TLS header)
   * partial data block
   * partial authentication tag (all data had been seen but part of the
     authentication tag has to be written or read from the subsequent segment)
  
  Record reassembly is not necessary for TLS offload. If the packets arrive
  in order the device should be able to handle them separately and make
  forward progress.
  
  TX
  --
  
  The kernel stack performs record framing reserving space for the authentication
  tag and populating all other TLS header and tailer fields.
  
  Both the device and the driver maintain expected TCP sequence numbers
  due to the possibility of retransmissions and the lack of software fallback
  once the packet reaches the device.
  For segments passed in order, the driver marks the packets with
  a connection identifier (note that a 5-tuple lookup is insufficient to identify
  packets requiring HW offload, see the :ref:`5tuple_problems` section)
  and hands them to the device. The device identifies the packet as requiring
  TLS handling and confirms the sequence number matches its expectation.
  The device performs encryption and authentication of the record data.
  It replaces the authentication tag and TCP checksum with correct values.
  
  RX
  --
  
  Before a packet is DMAed to the host (but after NIC's embedded switching
  and packet transformation functions) the device validates the Layer 4
  checksum and performs a 5-tuple lookup to find any TLS connection the packet
  may belong to (technically a 4-tuple
  lookup is sufficient - IP addresses and TCP port numbers, as the protocol
  is always TCP). If connection is matched device confirms if the TCP sequence
  number is the expected one and proceeds to TLS handling (record delineation,
  decryption, authentication for each record in the packet). The device leaves
  the record framing unmodified, the stack takes care of record decapsulation.
  Device indicates successful handling of TLS offload in the per-packet context
  (descriptor) passed to the host.
  
  Upon reception of a TLS offloaded packet, the driver sets
  the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
  corresponding to the segment. Networking stack makes sure decrypted
  and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)
  and takes care of partial decryption.
  
  Resync handling
  ===============
  
  In presence of packet drops or network packet reordering, the device may lose
  synchronization with the TLS stream, and require a resync with the kernel's
  TCP stack.
  
  Note that resync is only attempted for connections which were successfully
  added to the device table and are in TLS_HW mode. For example,
  if the table was full when cryptographic state was installed in the kernel,
  such connection will never get offloaded. Therefore the resync request
  does not carry any cryptographic connection state.
  
  TX
  --
  
  Segments transmitted from an offloaded socket can get out of sync
  in similar ways to the receive side-retransmissions - local drops
501800740   Jakub Kicinski   net/tls: add kern...
209
210
211
212
213
  are possible, though network reorders are not. There are currently
  two mechanisms for dealing with out of order segments.
  
  Crypto state rebuilding
  ~~~~~~~~~~~~~~~~~~~~~~~
f42c104f2   Jakub Kicinski   Documentation: ad...
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
  
  Whenever an out of order segment is transmitted the driver provides
  the device with enough information to perform cryptographic operations.
  This means most likely that the part of the record preceding the current
  segment has to be passed to the device as part of the packet context,
  together with its TCP sequence number and TLS record number. The device
  can then initialize its crypto state, process and discard the preceding
  data (to be able to insert the authentication tag) and move onto handling
  the actual packet.
  
  In this mode depending on the implementation the driver can either ask
  for a continuation with the crypto state and the new sequence number
  (next expected segment is the one after the out of order one), or continue
  with the previous stream state - assuming that the out of order segment
  was just a retransmission. The former is simpler, and does not require
  retransmission detection therefore it is the recommended method until
  such time it is proven inefficient.
501800740   Jakub Kicinski   net/tls: add kern...
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
  Next record sync
  ~~~~~~~~~~~~~~~~
  
  Whenever an out of order segment is detected the driver requests
  that the ``ktls`` software fallback code encrypt it. If the segment's
  sequence number is lower than expected the driver assumes retransmission
  and doesn't change device state. If the segment is in the future, it
  may imply a local drop, the driver asks the stack to sync the device
  to the next record state and falls back to software.
  
  Resync request is indicated with:
  
  .. code-block:: c
  
    void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq)
  
  Until resync is complete driver should not access its expected TCP
  sequence number (as it will be updated from a different context).
  Following helper should be used to test if resync is complete:
  
  .. code-block:: c
  
    bool tls_offload_tx_resync_pending(struct sock *sk)
  
  Next time ``ktls`` pushes a record it will first send its TCP sequence number
  and TLS record number to the driver. Stack will also make sure that
  the new record will start on a segment boundary (like it does when
  the connection is initially added).
f42c104f2   Jakub Kicinski   Documentation: ad...
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
  RX
  --
  
  A small amount of RX reorder events may not require a full resynchronization.
  In particular the device should not lose synchronization
  when record boundary can be recovered:
  
  .. kernel-figure::  tls-offload-reorder-good.svg
     :alt:	reorder of non-header segment
     :align:	center
  
     Reorder of non-header segment
  
  Green segments are successfully decrypted, blue ones are passed
  as received on wire, red stripes mark start of new records.
  
  In above case segment 1 is received and decrypted successfully.
  Segment 2 was dropped so 3 arrives out of order. The device knows
  the next record starts inside 3, based on record length in segment 1.
  Segment 3 is passed untouched, because due to lack of data from segment 2
  the remainder of the previous record inside segment 3 cannot be handled.
  The device can, however, collect the authentication algorithm's state
  and partial block from the new record in segment 3 and when 4 and 5
  arrive continue decryption. Finally when 2 arrives it's completely outside
  of expected window of the device so it's passed as is without special
  handling. ``ktls`` software fallback handles the decryption of record
  spanning segments 1, 2 and 3. The device did not get out of sync,
  even though two segments did not get decrypted.
  
  Kernel synchronization may be necessary if the lost segment contained
  a record header and arrived after the next record header has already passed:
  
  .. kernel-figure::  tls-offload-reorder-bad.svg
     :alt:	reorder of header segment
     :align:	center
  
     Reorder of segment with a TLS header
  
  In this example segment 2 gets dropped, and it contains a record header.
  Device can only detect that segment 4 also contains a TLS header
  if it knows the length of the previous record from segment 2. In this case
  the device will lose synchronization with the stream.
f953d33ba   Jakub Kicinski   net/tls: add kern...
301
302
  Stream scan resynchronization
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
f42c104f2   Jakub Kicinski   Documentation: ad...
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
  When the device gets out of sync and the stream reaches TCP sequence
  numbers more than a max size record past the expected TCP sequence number,
  the device starts scanning for a known header pattern. For example
  for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur
  in the SSL/TLS version field of the header. Once pattern is matched
  the device continues attempting parsing headers at expected locations
  (based on the length fields at guessed locations).
  Whenever the expected location does not contain a valid header the scan
  is restarted.
  
  When the header is matched the device sends a confirmation request
  to the kernel, asking if the guessed location is correct (if a TLS record
  really starts there), and which record sequence number the given header had.
  The kernel confirms the guessed location was correct and tells the device
  the record sequence number. Meanwhile, the device had been parsing
  and counting all records since the just-confirmed one, it adds the number
  of records it had seen to the record number provided by the kernel.
  At this point the device is in sync and can resume decryption at next
  segment boundary.
  
  In a pathological case the device may latch onto a sequence of matching
  headers and never hear back from the kernel (there is no negative
  confirmation from the kernel). The implementation may choose to periodically
  restart scan. Given how unlikely falsely-matching stream is, however,
  periodic restart is not deemed necessary.
  
  Special care has to be taken if the confirmation request is passed
  asynchronously to the packet stream and record may get processed
  by the kernel before the confirmation request.
f953d33ba   Jakub Kicinski   net/tls: add kern...
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
  Stack-driven resynchronization
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
  The driver may also request the stack to perform resynchronization
  whenever it sees the records are no longer getting decrypted.
  If the connection is configured in this mode the stack automatically
  schedules resynchronization after it has received two completely encrypted
  records.
  
  The stack waits for the socket to drain and informs the device about
  the next expected record number and its TCP sequence number. If the
  records continue to be received fully encrypted stack retries the
  synchronization with an exponential back off (first after 2 encrypted
  records, then after 4 records, after 8, after 16... up until every
  128 records).
f42c104f2   Jakub Kicinski   Documentation: ad...
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
  Error handling
  ==============
  
  TX
  --
  
  Packets may be redirected or rerouted by the stack to a different
  device than the selected TLS offload device. The stack will handle
  such condition using the :c:func:`sk_validate_xmit_skb` helper
  (TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook).
  Offload maintains information about all records until the data is
  fully acknowledged, so if skbs reach the wrong device they can be handled
  by software fallback.
  
  Any device TLS offload handling error on the transmission side must result
  in the packet being dropped. For example if a packet got out of order
  due to a bug in the stack or the device, reached the device and can't
  be encrypted such packet must be dropped.
  
  RX
  --
  
  If the device encounters any problems with TLS offload on the receive
  side it should pass the packet to the host's networking stack as it was
  received on the wire.
  
  For example authentication failure for any record in the segment should
  result in passing the unmodified packet to the software fallback. This means
  packets should not be modified "in place". Splitting segments to handle partial
  decryption is not advised. In other words either all records in the packet
  had been handled successfully and authenticated or the packet has to be passed
  to the host's stack as it was on the wire (recovering original packet in the
  driver if device provides precise error is sufficient).
  
  The Linux networking stack does not provide a way of reporting per-packet
  decryption and authentication errors, packets with errors must simply not
  have the :c:member:`decrypted` mark set.
  
  A packet should also not be handled by the TLS offload if it contains
  incorrect checksums.
  
  Performance metrics
  ===================
  
  TLS offload can be characterized by the following basic metrics:
  
   * max connection count
   * connection installation rate
   * connection installation latency
   * total cryptographic performance
  
  Note that each TCP connection requires a TLS session in both directions,
  the performance may be reported treating each direction separately.
  
  Max connection count
  --------------------
  
  The number of connections device can support can be exposed via
  ``devlink resource`` API.
  
  Total cryptographic performance
  -------------------------------
  
  Offload performance may depend on segment and record size.
  
  Overload of the cryptographic subsystem of the device should not have
  significant performance impact on non-offloaded streams.
  
  Statistics
  ==========
  
  Following minimum set of TLS-related statistics should be reported
  by the driver:
280c08991   Tariq Toukan   Documentation: TL...
420
421
422
423
424
425
426
427
428
429
   * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets
     which were part of a TLS stream.
   * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
     which were successfully decrypted.
   * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
     for encryption of their TLS payload.
   * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
     passed to the device for encryption.
   * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for
     encryption.
f42c104f2   Jakub Kicinski   Documentation: ad...
430
   * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
280c08991   Tariq Toukan   Documentation: TL...
431
     but did not arrive in the expected order.
2836654a2   Tariq Toukan   Documentation: TL...
432
433
434
435
   * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of
     a TLS stream and arrived out-of-order, but skipped the HW offload routine
     and went to the regular transmit flow as they were retransmissions of the
     connection handshake.
280c08991   Tariq Toukan   Documentation: TL...
436
437
438
439
440
441
   * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
     a TLS stream dropped, because they arrived out of order and associated
     record could not be found.
   * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS
     stream dropped, because they contain both data that has been encrypted by
     software and data that expects hardware crypto offload.
f42c104f2   Jakub Kicinski   Documentation: ad...
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
  
  Notable corner cases, exceptions and additional requirements
  ============================================================
  
  .. _5tuple_problems:
  
  5-tuple matching limitations
  ----------------------------
  
  The device can only recognize received packets based on the 5-tuple
  of the socket. Current ``ktls`` implementation will not offload sockets
  routed through software interfaces such as those used for tunneling
  or virtual networking. However, many packet transformations performed
  by the networking stack (most notably any BPF logic) do not require
  any intermediate software device, therefore a 5-tuple match may
  consistently miss at the device level. In such cases the device
  should still be able to perform TX offload (encryption) and should
  fallback cleanly to software decryption (RX).
  
  Out of order
  ------------
  
  Introducing extra processing in NICs should not cause packets to be
  transmitted or received out of order, for example pure ACK packets
  should not be reordered with respect to data segments.
  
  Ingress reorder
  ---------------
  
  A device is permitted to perform packet reordering for consecutive
  TCP segments (i.e. placing packets in the correct order) but any form
  of additional buffering is disallowed.
  
  Coexistence with standard networking offload features
  -----------------------------------------------------
  
  Offloaded ``ktls`` sockets should support standard TCP stack features
  transparently. Enabling device TLS offload should not cause any difference
  in packets as seen on the wire.
  
  Transport layer transparency
  ----------------------------
  
  The device should not modify any packet headers for the purpose
  of the simplifying TLS offload.
  
  The device should not depend on any packet headers beyond what is strictly
  necessary for TLS offload.
  
  Segment drops
  -------------
  
  Dropping packets is acceptable only in the event of catastrophic
  system errors and should never be used as an error handling mechanism
  in cases arising from normal operation. In other words, reliance
  on TCP retransmissions to handle corner cases is not acceptable.
  
  TLS device features
  -------------------
  
  Drivers should ignore the changes to TLS the device feature flags.
  These flags will be acted upon accordingly by the core ``ktls`` code.
  TLS device feature flags only control adding of new TLS connection
  offloads, old connections will remain active after flags are cleared.