22 Mar, 2020

1 commit

  • sockmap performs lockless writes to sk->sk_prot on the following paths:

    tcp_bpf_{recvmsg|sendmsg} / sock_map_unref
    sk_psock_put
    sk_psock_drop
    sk_psock_restore_proto
    WRITE_ONCE(sk->sk_prot, proto)

    To prevent load/store tearing [1], and to make tooling aware of intentional
    shared access [2], we need to annotate other sites that access sk_prot with
    READ_ONCE/WRITE_ONCE macros.

    Change done with Coccinelle with following semantic patch:

    @@
    expression E;
    identifier I;
    struct sock *sk;
    identifier sk_prot =~ "^sk_prot$";
    @@
    (
    E =
    -sk->sk_prot
    +READ_ONCE(sk->sk_prot)
    |
    -sk->sk_prot = E
    +WRITE_ONCE(sk->sk_prot, E)
    |
    -sk->sk_prot
    +READ_ONCE(sk->sk_prot)
    ->I
    )

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     

20 Feb, 2020

1 commit

  • Current code doesn't check if tcp sequence number is starting from (/after)
    1st record's start sequnce number. It only checks if seq number is before
    1st record's end sequnce number. This problem will always be a possibility
    in re-transmit case. If a record which belongs to a requested seq number is
    already deleted, tls_get_record will start looking into list and as per the
    check it will look if seq number is before the end seq of 1st record, which
    will always be true and will return 1st record always, it should in fact
    return NULL.
    As part of the fix, start looking each record only if the sequence number
    lies in the list else return NULL.
    There is one more check added, driver look for the start marker record to
    handle tcp packets which are before the tls offload start sequence number,
    hence return 1st record if the record is tls start marker and seq number is
    before the 1st record's starting sequence number.

    Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
    Signed-off-by: Rohit Maheshwari
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Rohit Maheshwari
     

20 Dec, 2019

1 commit


07 Dec, 2019

1 commit


10 Nov, 2019

1 commit


07 Nov, 2019

2 commits

  • TLS TX needs to release and re-acquire the socket lock if send buffer
    fills up.

    TLS SW TX path currently depends on only allowing one thread to enter
    the function by the abuse of sk_write_pending. If another writer is
    already waiting for memory no new ones are allowed in.

    This has two problems:
    - writers don't wake other threads up when they leave the kernel;
    meaning that this scheme works for single extra thread (second
    application thread or delayed work) because memory becoming
    available will send a wake up request, but as Mallesham and
    Pooja report with larger number of threads it leads to threads
    being put to sleep indefinitely;
    - the delayed work does not get _scheduled_ but it may _run_ when
    other writers are present leading to crashes as writers don't
    expect state to change under their feet (same records get pushed
    and freed multiple times); it's hard to reliably bail from the
    work, however, because the mere presence of a writer does not
    guarantee that the writer will push pending records before exiting.

    Ensuring wakeups always happen will make the code basically open
    code a mutex. Just use a mutex.

    The TLS HW TX path does not have any locking (not even the
    sk_write_pending hack), yet it uses a per-socket sg_tx_data
    array to push records.

    Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
    Reported-by: Mallesham Jatharakonda
    Reported-by: Pooja Trivedi
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • sk_write_pending being not zero does not guarantee that partial
    record will be pushed. If the thread waiting for memory times out
    the pending record may get stuck.

    In case of tls_device there is no path where parial record is
    set and writer present in the first place. Partial record is
    set only in tls_push_sg() and tls_push_sg() will return an
    error immediately. All tls_device callers of tls_push_sg()
    will return (and not wait for memory) if it failed.

    Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

07 Oct, 2019

3 commits


06 Oct, 2019

3 commits

  • Add a statistic for number of RX resyncs sent down to the NIC.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Add a tracepoint to the TLS offload's fast path. This tracepoint
    can be used to track the decrypted and encrypted status of received
    records. Records decrypted by the device should have decrypted set
    to 1, records which have neither decrypted nor decrypted set are
    partially decrypted, require re-encryption and therefore are most
    expensive to deal with.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Add tracing of device-related interaction to aid performance
    analysis, especially around resync:

    tls:tls_device_offload_set
    tls:tls_device_rx_resync_send
    tls:tls_device_rx_resync_nh_schedule
    tls:tls_device_rx_resync_nh_delay
    tls:tls_device_tx_resync_req
    tls:tls_device_tx_resync_send

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

08 Sep, 2019

4 commits

  • Unlike normal TCP code TLS has to touch the cache lines
    it copies into to fill header info. On memory-heavy workloads
    having non temporal stores and normal accesses targeting
    the same cache line leads to significant overhead.

    Measured 3% overhead running 3600 round robin connections
    with additional memory heavy workload.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • For TLS device offload the tag/message authentication code are
    filled in by the device. The kernel merely reserves space for
    them. Because device overwrites it, the contents of the tag make
    do no matter. Current code tries to save space by reusing the
    header as the tag. This, however, leads to an additional frag
    being created and defeats buffer coalescing (which trickles
    all the way down to the drivers).

    Remove this optimization, and try to allocate the space for
    the tag in the usual way, leave the memory uninitialized.
    If memory allocation fails rewind the record pointer so that
    we use the already copied user data as tag.

    Note that the optimization was actually buggy, as the tag
    for TLS 1.2 is 16 bytes, but header is just 13, so the reuse
    may had looked past the end of the page..

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • All modifications to TLS record list happen under the socket
    lock. Since records form an ordered queue readers are only
    concerned about elements being removed, additions can happen
    concurrently.

    Use RCU primitives to ensure the correct access types
    (READ_ONCE/WRITE_ONCE).

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • It's generally more cache friendly to walk arrays in order,
    especially those which are likely not in cache.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

05 Sep, 2019

3 commits

  • If retransmit record hint fall into the cleanup window we will
    free it by just walking the list. No need to duplicate the code.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: John Hurley
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • On setsockopt path we need to hold device_offload_lock from
    the moment we check netdev is up until the context is fully
    ready to be added to the tls_device_list.

    No need to hold it around the get_netdev_for_sock().
    Change the code and remove the confusing comment.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: John Hurley
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Reusing parts of error path for normal exit will make
    next commit harder to read, untangle the two.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: John Hurley
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

01 Sep, 2019

1 commit

  • We need to make sure context does not get freed while diag
    code is interrogating it. Free struct tls_context with
    kfree_rcu().

    We add the __rcu annotation directly in icsk, and cast it
    away in the datapath accessor. Presumably all ULPs will
    do a similar thing.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

20 Aug, 2019

1 commit


09 Aug, 2019

1 commit

  • sk_validate_xmit_skb() and drivers depend on the sk member of
    struct sk_buff to identify segments requiring encryption.
    Any operation which removes or does not preserve the original TLS
    socket such as skb_orphan() or skb_clone() will cause clear text
    leaks.

    Make the TCP socket underlying an offloaded TLS connection
    mark all skbs as decrypted, if TLS TX is in offload mode.
    Then in sk_validate_xmit_skb() catch skbs which have no socket
    (or a socket with no validation) and decrypted flag set.

    Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
    sk->sk_validate_xmit_skb are slightly interchangeable right now,
    they all imply TLS offload. The new checks are guarded by
    CONFIG_TLS_DEVICE because that's the option guarding the
    sk_buff->decrypted member.

    Second, smaller issue with orphaning is that it breaks
    the guarantee that packets will be delivered to device
    queues in-order. All TLS offload drivers depend on that
    scheduling property. This means skb_orphan_partial()'s
    trick of preserving partial socket references will cause
    issues in the drivers. We need a full orphan, and as a
    result netem delay/throttling will cause all TLS offload
    skbs to be dropped.

    Reusing the sk_buff->decrypted flag also protects from
    leaking clear text when incoming, decrypted skb is redirected
    (e.g. by TC).

    See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect
    through ULP") for justification why the internal flag is safe.
    The only location which could leak the flag in is tcp_bpf_sendmsg(),
    which is taken care of by clearing the previously unused bit.

    v2:
    - remove superfluous decrypted mark copy (Willem);
    - remove the stale doc entry (Boris);
    - rely entirely on EOR marking to prevent coalescing (Boris);
    - use an internal sendpages flag instead of marking the socket
    (Boris).
    v3 (Willem):
    - reorganize the can_skb_orphan_partial() condition;
    - fix the flag leak-in through tcp_bpf_sendmsg.

    Signed-off-by: Jakub Kicinski
    Acked-by: Willem de Bruijn
    Reviewed-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

31 Jul, 2019

1 commit


23 Jul, 2019

1 commit


09 Jul, 2019

3 commits

  • Turns out TLS_TX in HW offload mode does not initialize tls_prot_info.
    Since commit 9cd81988cce1 ("net/tls: use version from prot") we actually
    use this field on the datapath. Luckily we always compare it to TLS 1.3,
    and assume 1.2 otherwise. So since zero is not equal to 1.3, everything
    worked fine.

    Fixes: 9cd81988cce1 ("net/tls: use version from prot")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Introduce a return code for the tls_dev_resync callback.

    When the driver TX resync fails, kernel can retry the resync again
    until it succeeds. This prevents drivers from attempting to offload
    TLS packets if the connection is known to be out of sync.

    We don't worry about the RX resync since they will be retried naturally
    as more encrypted records get received.

    Signed-off-by: Dirk van der Merwe
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Dirk van der Merwe
     
  • Two cases of overlapping changes, nothing fancy.

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Jul, 2019

2 commits

  • Commit 86029d10af18 ("tls: zero the crypto information from tls_context
    before freeing") added memzero_explicit() calls to clear the key material
    before freeing struct tls_context, but it missed tls_device.c has its
    own way of freeing this structure. Replace the missing free.

    Fixes: 86029d10af18 ("tls: zero the crypto information from tls_context before freeing")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Neither drivers nor the tls offload code currently supports TLS
    version 1.3. Check the TLS version when installing connection
    state. TLS 1.3 will just fallback to the kernel crypto for now.

    Fixes: 130b392c6cd6 ("net: tls: Add tls 1.3 support")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

12 Jun, 2019

6 commits

  • TLS offload drivers keep track of TCP seq numbers to make sure
    the packets are fed into the HW in order.

    When packets get dropped on the way through the stack, the driver
    will get out of sync and have to use fallback encryption, but unless
    TCP seq number is resynced it will never match the packets correctly
    (or even worse - use incorrect record sequence number after TCP seq
    wraps).

    Existing drivers (mlx5) feed the entire record on every out-of-order
    event, allowing FW/HW to always be in sync.

    This patch adds an alternative, more akin to the RX resync. When
    driver sees a frame which is past its expected sequence number the
    stream must have gotten out of order (if the sequence number is
    smaller than expected its likely a retransmission which doesn't
    require resync). Driver will ask the stack to perform TX sync
    before it submits the next full record, and fall back to software
    crypto until stack has performed the sync.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Currently only RX direction is ever resynced, however, TX may
    also get out of sequence if packets get dropped on the way to
    the driver. Rename the resync callback and add a direction
    parameter.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • TLS offload device may lose sync with the TCP stream if packets
    arrive out of order. Drivers can currently request a resync at
    a specific TCP sequence number. When a record is found starting
    at that sequence number kernel will inform the device of the
    corresponding record number.

    This requires the device to constantly scan the stream for a
    known pattern (constant bytes of the header) after sync is lost.

    This patch adds an alternative approach which is entirely under
    the control of the kernel. Kernel tracks records it had to fully
    decrypt, even though TLS socket is in TLS_HW mode. If multiple
    records did not have any decrypted parts - it's a pretty strong
    indication that the device is out of sync.

    We choose the min number of fully encrypted records to be 2,
    which should hopefully be more than will get retransmitted at
    a time.

    After kernel decides the device is out of sync it schedules a
    resync request. If the TCP socket is empty the resync gets
    performed immediately. If socket is not empty we leave the
    record parser to resync when next record comes.

    Before resync in message parser we peek at the TCP socket and
    don't attempt the sync if the socket already has some of the
    next record queued.

    On resync failure (encrypted data continues to flow in) we
    retry with exponential backoff, up to once every 128 records
    (with a 16k record thats at most once every 2M of data).

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • handle_device_resync() doesn't describe the function very well.
    The function checks if resync should be issued upon parsing of
    a new record.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • TLS offload code casts record number to a u64. The buffer
    should be aligned to 8 bytes, but its actually a __be64, and
    the rest of the TLS code treats it as big int. Make the
    offload callbacks take a byte array, drivers can make the
    choice to do the ugly cast if they want to.

    Prepare for copying the record number onto the stack by
    defining a constant for max size of the byte array.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • We subtract "TLS_HEADER_SIZE - 1" from req_seq, then if they
    match we add the same constant to seq. Just add it to seq,
    and we don't have to touch req_seq.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

08 Jun, 2019

1 commit


05 Jun, 2019

3 commits