22 Mar, 2020

3 commits

  • sockmap performs lockless writes to sk->sk_prot on the following paths:

    tcp_bpf_{recvmsg|sendmsg} / sock_map_unref
    sk_psock_put
    sk_psock_drop
    sk_psock_restore_proto
    WRITE_ONCE(sk->sk_prot, proto)

    To prevent load/store tearing [1], and to make tooling aware of intentional
    shared access [2], we need to annotate other sites that access sk_prot with
    READ_ONCE/WRITE_ONCE macros.

    Change done with Coccinelle with following semantic patch:

    @@
    expression E;
    identifier I;
    struct sock *sk;
    identifier sk_prot =~ "^sk_prot$";
    @@
    (
    E =
    -sk->sk_prot
    +READ_ONCE(sk->sk_prot)
    |
    -sk->sk_prot = E
    +WRITE_ONCE(sk->sk_prot, E)
    |
    -sk->sk_prot
    +READ_ONCE(sk->sk_prot)
    ->I
    )

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     
  • Apart from being a "tremendous" win when it comes to generated machine
    code (see bloat-o-meter output for x86-64 below) this mainly prepares
    ground for annotating access to sk_prot with READ_ONCE, so that we don't
    pepper the code with access annotations and needlessly repeat loads.

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-46 (-46)
    Function old new delta
    tls_init 851 805 -46
    Total: Before=21063, After=21017, chg -0.22%

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     
  • The helper that builds kTLS proto ops doesn't need to and should not modify
    the base proto ops. Annotate the parameter as read-only.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     

22 Feb, 2020

2 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2020-02-21

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 25 non-merge commits during the last 4 day(s) which contain
    a total of 33 files changed, 2433 insertions(+), 161 deletions(-).

    The main changes are:

    1) Allow for adding TCP listen sockets into sock_map/hash so they can be used
    with reuseport BPF programs, from Jakub Sitnicki.

    2) Add a new bpf_program__set_attach_target() helper for adding libbpf support
    to specify the tracepoint/function dynamically, from Eelco Chaudron.

    3) Add bpf_read_branch_records() BPF helper which helps use cases like profile
    guided optimizations, from Daniel Xu.

    4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu.

    5) Relax BTF mandatory check if only used for libbpf itself e.g. to process
    BTF defined maps, from Andrii Nakryiko.

    6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has
    been observed that former fails in envs with low memlock, from Yonghong Song.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • sk_msg and ULP frameworks override protocol callbacks pointer in
    sk->sk_prot, while tcp accesses it locklessly when cloning the listening
    socket, that is with neither sk_lock nor sk_callback_lock held.

    Once we enable use of listening sockets with sockmap (and hence sk_msg),
    there will be shared access to sk->sk_prot if socket is getting cloned
    while being inserted/deleted to/from the sockmap from another CPU:

    Read side:

    tcp_v4_rcv
    sk = __inet_lookup_skb(...)
    tcp_check_req(sk)
    inet_csk(sk)->icsk_af_ops->syn_recv_sock
    tcp_v4_syn_recv_sock
    tcp_create_openreq_child
    inet_csk_clone_lock
    sk_clone_lock
    READ_ONCE(sk->sk_prot)

    Write side:

    sock_map_ops->map_update_elem
    sock_map_update_elem
    sock_map_update_common
    sock_map_link_no_progs
    tcp_bpf_init
    tcp_bpf_update_sk_prot
    sk_psock_update_proto
    WRITE_ONCE(sk->sk_prot, ops)

    sock_map_ops->map_delete_elem
    sock_map_delete_elem
    __sock_map_delete
    sock_map_unref
    sk_psock_put
    sk_psock_drop
    sk_psock_restore_proto
    tcp_update_ulp
    WRITE_ONCE(sk->sk_prot, proto)

    Mark the shared access with READ_ONCE/WRITE_ONCE annotations.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com

    Jakub Sitnicki
     

20 Feb, 2020

1 commit

  • Current code doesn't check if tcp sequence number is starting from (/after)
    1st record's start sequnce number. It only checks if seq number is before
    1st record's end sequnce number. This problem will always be a possibility
    in re-transmit case. If a record which belongs to a requested seq number is
    already deleted, tls_get_record will start looking into list and as per the
    check it will look if seq number is before the end seq of 1st record, which
    will always be true and will return 1st record always, it should in fact
    return NULL.
    As part of the fix, start looking each record only if the sequence number
    lies in the list else return NULL.
    There is one more check added, driver look for the start marker record to
    handle tcp packets which are before the tls offload start sequence number,
    hence return 1st record if the record is tls start marker and seq number is
    before the 1st record's starting sequence number.

    Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
    Signed-off-by: Rohit Maheshwari
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Rohit Maheshwari
     

20 Jan, 2020

1 commit


16 Jan, 2020

5 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2020-01-15

    The following pull-request contains BPF updates for your *net* tree.

    We've added 12 non-merge commits during the last 9 day(s) which contain
    a total of 13 files changed, 95 insertions(+), 43 deletions(-).

    The main changes are:

    1) Fix refcount leak for TCP time wait and request sockets for socket lookup
    related BPF helpers, from Lorenz Bauer.

    2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann.

    3) Batch of several sockmap and related TLS fixes found while operating
    more complex BPF programs with Cilium and OpenSSL, from John Fastabend.

    4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue()
    to avoid purging data upon teardown, from Lingpeng Chen.

    5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly
    dump a BPF map's value with BTF, from Martin KaFai Lau.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When user returns SK_DROP we need to reset the number of copied bytes
    to indicate to the user the bytes were dropped and not sent. If we
    don't reset the copied arg sendmsg will return as if those bytes were
    copied giving the user a positive return value.

    This works as expected today except in the case where the user also
    pops bytes. In the pop case the sg.size is reduced but we don't correctly
    account for this when copied bytes is reset. The popped bytes are not
    accounted for and we return a small positive value potentially confusing
    the user.

    The reason this happens is due to a typo where we do the wrong comparison
    when accounting for pop bytes. In this fix notice the if/else is not
    needed and that we have a similar problem if we push data except its not
    visible to the user because if delta is larger the sg.size we return a
    negative value so it appears as an error regardless.

    Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Jonathan Lemon
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com

    John Fastabend
     
  • Its possible through a set of push, pop, apply helper calls to construct
    a skmsg, which is just a ring of scatterlist elements, with the start
    value larger than the end value. For example,

    end start
    |_0_|_1_| ... |_n_|_n+1_|

    Where end points at 1 and start points and n so that valid elements is
    the set {n, n+1, 0, 1}.

    Currently, because we don't build the correct chain only {n, n+1} will
    be sent. This adds a check and sg_chain call to correctly submit the
    above to the crypto and tls send path.

    Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Jonathan Lemon
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-8-john.fastabend@gmail.com

    John Fastabend
     
  • It is possible to build a plaintext buffer using push helper that is larger
    than the allocated encrypt buffer. When this record is pushed to crypto
    layers this can result in a NULL pointer dereference because the crypto
    API expects the encrypt buffer is large enough to fit the plaintext
    buffer. Kernel splat below.

    To resolve catch the cases this can happen and split the buffer into two
    records to send individually. Unfortunately, there is still one case to
    handle where the split creates a zero sized buffer. In this case we merge
    the buffers and unmark the split. This happens when apply is zero and user
    pushed data beyond encrypt buffer. This fixes the original case as well
    because the split allocated an encrypt buffer larger than the plaintext
    buffer and the merge simply moves the pointers around so we now have
    a reference to the new (larger) encrypt buffer.

    Perhaps its not ideal but it seems the best solution for a fixes branch
    and avoids handling these two cases, (a) apply that needs split and (b)
    non apply case. The are edge cases anyways so optimizing them seems not
    necessary unless someone wants later in next branches.

    [ 306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
    [...]
    [ 306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
    [...]
    [ 306.770350] Call Trace:
    [ 306.770956] scatterwalk_map_and_copy+0x6c/0x80
    [ 306.772026] gcm_enc_copy_hash+0x4b/0x50
    [ 306.772925] gcm_hash_crypt_remain_continue+0xef/0x110
    [ 306.774138] gcm_hash_crypt_continue+0xa1/0xb0
    [ 306.775103] ? gcm_hash_crypt_continue+0xa1/0xb0
    [ 306.776103] gcm_hash_assoc_remain_continue+0x94/0xa0
    [ 306.777170] gcm_hash_assoc_continue+0x9d/0xb0
    [ 306.778239] gcm_hash_init_continue+0x8f/0xa0
    [ 306.779121] gcm_hash+0x73/0x80
    [ 306.779762] gcm_encrypt_continue+0x6d/0x80
    [ 306.780582] crypto_gcm_encrypt+0xcb/0xe0
    [ 306.781474] crypto_aead_encrypt+0x1f/0x30
    [ 306.782353] tls_push_record+0x3b9/0xb20 [tls]
    [ 306.783314] ? sk_psock_msg_verdict+0x199/0x300
    [ 306.784287] bpf_exec_tx_verdict+0x3f2/0x680 [tls]
    [ 306.785357] tls_sw_sendmsg+0x4a3/0x6a0 [tls]

    test_sockmap test signature to trigger bug,

    [TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):

    Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Jonathan Lemon
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-7-john.fastabend@gmail.com

    John Fastabend
     
  • When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
    and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
    don't push the write_space callback up and instead simply overwrite the
    op with the psock stored previous op. This may or may not be correct so
    to ensure we don't overwrite the TLS write space hook pass this field to
    the ULP and have it fixup the ctx.

    This completes a previous fix that pushed the ops through to the ULP
    but at the time missed doing this for write_space, presumably because
    write_space TLS hook was added around the same time.

    Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Acked-by: Jonathan Lemon
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com

    John Fastabend
     

11 Jan, 2020

2 commits

  • Mallesham reports the TLS with async accelerator was broken by
    commit d10523d0b3d7 ("net/tls: free the record on encryption error")
    because encryption can return -EINPROGRESS in such setups, which
    should not be treated as an error.

    The error is also present in the BPF path (likely copied from there).

    Reported-by: Mallesham Jatharakonda
    Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
    Fixes: d10523d0b3d7 ("net/tls: free the record on encryption error")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • When device loses sync mid way through a record - kernel
    has to re-encrypt the part of the record which the device
    already decrypted to be able to decrypt and authenticate
    the record in its entirety.

    The re-encryption piggy backs on the decryption routine,
    but obviously because the partially decrypted record can't
    be authenticated crypto API returns an error which is then
    ignored by tls_device_reencrypt().

    Commit 5c5ec6685806 ("net/tls: add TlsDecryptError stat")
    added a statistic to count decryption errors, this statistic
    can't be incremented when we see the expected re-encryption
    error. Move the inc to the caller.

    Reported-and-tested-by: David Beckett
    Fixes: 5c5ec6685806 ("net/tls: add TlsDecryptError stat")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

20 Dec, 2019

1 commit


07 Dec, 2019

1 commit


29 Nov, 2019

4 commits

  • Partially sent record cleanup path increments an SG entry
    directly instead of using sg_next(). This should not be a
    problem today, as encrypted messages should be always
    allocated as arrays. But given this is a cleanup path it's
    easy to miss was this ever to change. Use sg_next(), and
    simplify the code.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Looks like when BPF support was added by commit d3b18ad31f93
    ("tls: add bpf support to sk_msg handling") and
    commit d829e9c4112b ("tls: convert to generic sk_msg interface")
    it broke/removed the support for in-place crypto as added by
    commit 4e6d47206c32 ("tls: Add support for inplace records
    encryption").

    The inplace_crypto member of struct tls_rec is dead, inited
    to zero, and sometimes set to zero again. It used to be
    set to 1 when record was allocated, but the skmsg code doesn't
    seem to have been written with the idea of in-place crypto
    in mind.

    Since non trivial effort is required to bring the feature back
    and we don't really have the HW to measure the benefit just
    remove the left over support for now to avoid confusing readers.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • When tls_do_encryption() fails the SG lists are left with the
    SG_END and SG_CHAIN marks in place. One could hope that once
    encryption fails we will never see the record again, but that
    is in fact not true. Commit d3b18ad31f93 ("tls: add bpf support
    to sk_msg handling") added special handling to ENOMEM and ENOSPC
    errors which mean we may see the same record re-submitted.

    As suggested by John free the record, the BPF code is already
    doing just that.

    Reported-by: syzbot+df0d4ec12332661dd1f9@syzkaller.appspotmail.com
    Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • bpf_exec_tx_verdict() may free the record if tls_push_record()
    fails, or if the entire record got consumed by BPF. Re-check
    ctx->open_rec before touching the data.

    Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

23 Nov, 2019

1 commit


20 Nov, 2019

1 commit

  • Bring back tls_sw_sendpage_locked. sk_msg redirection into a socket
    with TLS_TX takes the following path:

    tcp_bpf_sendmsg_redir
    tcp_bpf_push_locked
    tcp_bpf_push
    kernel_sendpage_locked
    sock->ops->sendpage_locked

    Also update the flags test in tls_sw_sendpage_locked to allow flag
    MSG_NO_SHARED_FRAGS. bpf_tcp_sendmsg sets this.

    Link: https://lore.kernel.org/netdev/CA+FuTSdaAawmZ2N8nfDDKu3XLpXBbMtcCT0q4FntDD2gn8ASUw@mail.gmail.com/T/#t
    Link: https://github.com/wdebruij/kerneltools/commits/icept.2
    Fixes: 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP")
    Fixes: f3de19af0f5b ("Revert \"net/tls: remove unused function tls_sw_sendpage_locked\"")
    Signed-off-by: Willem de Bruijn
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

16 Nov, 2019

1 commit

  • If PROC_FS is not set, gcc warning this:

    net/tls/tls_proc.c:23:12: warning:
    'tls_statistics_seq_show' defined but not used [-Wunused-function]

    Use #ifdef to guard this.

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Acked-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    YueHaibing
     

10 Nov, 2019

1 commit


07 Nov, 2019

2 commits

  • TLS TX needs to release and re-acquire the socket lock if send buffer
    fills up.

    TLS SW TX path currently depends on only allowing one thread to enter
    the function by the abuse of sk_write_pending. If another writer is
    already waiting for memory no new ones are allowed in.

    This has two problems:
    - writers don't wake other threads up when they leave the kernel;
    meaning that this scheme works for single extra thread (second
    application thread or delayed work) because memory becoming
    available will send a wake up request, but as Mallesham and
    Pooja report with larger number of threads it leads to threads
    being put to sleep indefinitely;
    - the delayed work does not get _scheduled_ but it may _run_ when
    other writers are present leading to crashes as writers don't
    expect state to change under their feet (same records get pushed
    and freed multiple times); it's hard to reliably bail from the
    work, however, because the mere presence of a writer does not
    guarantee that the writer will push pending records before exiting.

    Ensuring wakeups always happen will make the code basically open
    code a mutex. Just use a mutex.

    The TLS HW TX path does not have any locking (not even the
    sk_write_pending hack), yet it uses a per-socket sg_tx_data
    array to push records.

    Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
    Reported-by: Mallesham Jatharakonda
    Reported-by: Pooja Trivedi
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • sk_write_pending being not zero does not guarantee that partial
    record will be pushed. If the thread waiting for memory times out
    the pending record may get stuck.

    In case of tls_device there is no path where parial record is
    set and writer present in the first place. Partial record is
    set only in tls_push_sg() and tls_push_sg() will return an
    error immediately. All tls_device callers of tls_push_sg()
    will return (and not wait for memory) if it failed.

    Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

07 Oct, 2019

5 commits


06 Oct, 2019

6 commits

  • Add a statistic for number of RX resyncs sent down to the NIC.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Add a statistic for TLS record decryption errors.

    Since devices are supposed to pass records as-is when they
    encounter errors this statistic will count bad records in
    both pure software and inline crypto configurations.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Add SNMP stats for number of sockets with successfully
    installed sessions. Break them down to software and
    hardware ones. Note that if hardware offload fails
    stack uses software implementation, and counts the
    session appropriately.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Add a skeleton structure for adding TLS statistics.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Add a tracepoint to the TLS offload's fast path. This tracepoint
    can be used to track the decrypted and encrypted status of received
    records. Records decrypted by the device should have decrypted set
    to 1, records which have neither decrypted nor decrypted set are
    partially decrypted, require re-encryption and therefore are most
    expensive to deal with.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Add tracing of device-related interaction to aid performance
    analysis, especially around resync:

    tls:tls_device_offload_set
    tls:tls_device_rx_resync_send
    tls:tls_device_rx_resync_nh_schedule
    tls:tls_device_rx_resync_nh_delay
    tls:tls_device_tx_resync_req
    tls:tls_device_tx_resync_send

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

05 Oct, 2019

3 commits