14 Oct, 2020

1 commit

  • Replace commas with semicolons. Commas introduce unnecessary
    variability in the code structure and are hard to see. What is done
    is essentially described by the following Coccinelle semantic patch
    (http://coccinelle.lip6.fr/):

    //
    @@ expression e1,e2; @@
    e1
    -,
    +;
    e2
    ... when any
    //

    Signed-off-by: Julia Lawall
    Link: https://lore.kernel.org/r/1602412498-32025-6-git-send-email-Julia.Lawall@inria.fr
    Signed-off-by: Jakub Kicinski

    Julia Lawall
     

02 Sep, 2020

1 commit

  • Implement the getsockopt SOL_TLS TLS_RX which is currently missing. The
    primary usecase is to use it in conjunction with TCP_REPAIR to
    checkpoint/restore the TLS record layer state.

    TLS connection state usually exists on the user space library. So
    basically we can easily extract it from there, but when the TLS
    connections are delegated to the kTLS, it is not the case. We need to
    have a way to extract the TLS state from the kernel for both of TX and
    RX side.

    The new TLS_RX getsockopt copies the crypto_info to user in the same
    way as TLS_TX does.

    We have described use cases in our research work in Netdev 0x14
    Transport Workshop [1].

    Also, there is an TLS implementation called tlse [2] which supports
    TLS connection migration. They have support of kTLS and their code
    shows that they are expecting the future support of this option.

    [1] https://speakerdeck.com/yutarohayakawa/prism-proxies-without-the-pain
    [2] https://github.com/eduardsui/tlse

    Signed-off-by: Yutaro Hayakawa
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Yutaro Hayakawa
     

29 Jul, 2020

1 commit

  • sockptr_advance never properly worked. Replace it with _offset variants
    of copy_from_sockptr and copy_to_sockptr.

    Fixes: ba423fdaa589 ("net: add a new sockptr_t type")
    Reported-by: Jason A. Donenfeld
    Reported-by: Ido Schimmel
    Signed-off-by: Christoph Hellwig
    Acked-by: Jason A. Donenfeld
    Tested-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

25 Jul, 2020

1 commit

  • Rework the remaining setsockopt code to pass a sockptr_t instead of a
    plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
    outside of architecture specific code.

    Signed-off-by: Christoph Hellwig
    Acked-by: Stefan Schmidt [ieee802154]
    Acked-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

11 Jun, 2020

1 commit

  • Pull READ/WRITE_ONCE rework from Will Deacon:
    "This the READ_ONCE rework I've been working on for a while, which
    bumps the minimum GCC version and improves code-gen on arm64 when
    stack protector is enabled"

    [ Side note: I'm _really_ tempted to raise the minimum gcc version to
    4.9, so that we can just say that we require _Generic() support.

    That would allow us to more cleanly handle a lot of the cases where we
    depend on very complex macros with 'sizeof' or __builtin_choose_expr()
    with __builtin_types_compatible_p() etc.

    This branch has a workaround for sparse not handling _Generic(),
    either, but that was already fixed in the sparse development branch,
    so it's really just gcc-4.9 that we'd require. - Linus ]

    * 'rwonce/rework' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux:
    compiler_types.h: Use unoptimized __unqual_scalar_typeof for sparse
    compiler_types.h: Optimize __unqual_scalar_typeof compilation time
    compiler.h: Enforce that READ_ONCE_NOCHECK() access size is sizeof(long)
    compiler-types.h: Include naked type in __pick_integer_type() match
    READ_ONCE: Fix comment describing 2x32-bit atomicity
    gcov: Remove old GCC 3.4 support
    arm64: barrier: Use '__unqual_scalar_typeof' for acquire/release macros
    locking/barriers: Use '__unqual_scalar_typeof' for load-acquire macros
    READ_ONCE: Drop pointer qualifiers when reading from scalar types
    READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses
    READ_ONCE: Simplify implementations of {READ,WRITE}_ONCE()
    arm64: csum: Disable KASAN for do_csum()
    fault_inject: Don't rely on "return value" from WRITE_ONCE()
    net: tls: Avoid assigning 'const' pointer to non-const pointer
    netfilter: Avoid assigning 'const' pointer to non-const pointer
    compiler/gcc: Raise minimum GCC version for kernel builds to 4.8

    Linus Torvalds
     

16 Apr, 2020

1 commit

  • tls_build_proto() uses WRITE_ONCE() to assign a 'const' pointer to a
    'non-const' pointer. Cleanups to the implementation of WRITE_ONCE() mean
    that this will give rise to a compiler warning, just like a plain old
    assignment would do:

    | net/tls/tls_main.c: In function ‘tls_build_proto’:
    | ./include/linux/compiler.h:229:30: warning: assignment discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
    | net/tls/tls_main.c:640:4: note: in expansion of macro ‘smp_store_release’
    | 640 | smp_store_release(&saved_tcpv6_prot, prot);
    | | ^~~~~~~~~~~~~~~~~

    Drop the const qualifier from the local 'prot' variable, as it isn't
    needed.

    Cc: Boris Pismenny
    Cc: Aviad Yehezkel
    Cc: John Fastabend
    Cc: Daniel Borkmann
    Signed-off-by: Will Deacon

    Will Deacon
     

09 Apr, 2020

1 commit

  • Building with some experimental patches, I came across a warning
    in the tls code:

    include/linux/compiler.h:215:30: warning: assignment discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
    215 | *(volatile typeof(x) *)&(x) = (val); \
    | ^
    net/tls/tls_main.c:650:4: note: in expansion of macro 'smp_store_release'
    650 | smp_store_release(&saved_tcpv4_prot, prot);

    This appears to be a legitimate warning about assigning a const pointer
    into the non-const 'saved_tcpv4_prot' global. Annotate both the ipv4 and
    ipv6 pointers 'const' to make the code internally consistent.

    Fixes: 5bb4c45d466c ("net/tls: Read sk_prot once when building tls proto ops")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

22 Mar, 2020

3 commits

  • sockmap performs lockless writes to sk->sk_prot on the following paths:

    tcp_bpf_{recvmsg|sendmsg} / sock_map_unref
    sk_psock_put
    sk_psock_drop
    sk_psock_restore_proto
    WRITE_ONCE(sk->sk_prot, proto)

    To prevent load/store tearing [1], and to make tooling aware of intentional
    shared access [2], we need to annotate other sites that access sk_prot with
    READ_ONCE/WRITE_ONCE macros.

    Change done with Coccinelle with following semantic patch:

    @@
    expression E;
    identifier I;
    struct sock *sk;
    identifier sk_prot =~ "^sk_prot$";
    @@
    (
    E =
    -sk->sk_prot
    +READ_ONCE(sk->sk_prot)
    |
    -sk->sk_prot = E
    +WRITE_ONCE(sk->sk_prot, E)
    |
    -sk->sk_prot
    +READ_ONCE(sk->sk_prot)
    ->I
    )

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     
  • Apart from being a "tremendous" win when it comes to generated machine
    code (see bloat-o-meter output for x86-64 below) this mainly prepares
    ground for annotating access to sk_prot with READ_ONCE, so that we don't
    pepper the code with access annotations and needlessly repeat loads.

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-46 (-46)
    Function old new delta
    tls_init 851 805 -46
    Total: Before=21063, After=21017, chg -0.22%

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     
  • The helper that builds kTLS proto ops doesn't need to and should not modify
    the base proto ops. Annotate the parameter as read-only.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     

22 Feb, 2020

1 commit

  • sk_msg and ULP frameworks override protocol callbacks pointer in
    sk->sk_prot, while tcp accesses it locklessly when cloning the listening
    socket, that is with neither sk_lock nor sk_callback_lock held.

    Once we enable use of listening sockets with sockmap (and hence sk_msg),
    there will be shared access to sk->sk_prot if socket is getting cloned
    while being inserted/deleted to/from the sockmap from another CPU:

    Read side:

    tcp_v4_rcv
    sk = __inet_lookup_skb(...)
    tcp_check_req(sk)
    inet_csk(sk)->icsk_af_ops->syn_recv_sock
    tcp_v4_syn_recv_sock
    tcp_create_openreq_child
    inet_csk_clone_lock
    sk_clone_lock
    READ_ONCE(sk->sk_prot)

    Write side:

    sock_map_ops->map_update_elem
    sock_map_update_elem
    sock_map_update_common
    sock_map_link_no_progs
    tcp_bpf_init
    tcp_bpf_update_sk_prot
    sk_psock_update_proto
    WRITE_ONCE(sk->sk_prot, ops)

    sock_map_ops->map_delete_elem
    sock_map_delete_elem
    __sock_map_delete
    sock_map_unref
    sk_psock_put
    sk_psock_drop
    sk_psock_restore_proto
    tcp_update_ulp
    WRITE_ONCE(sk->sk_prot, proto)

    Mark the shared access with READ_ONCE/WRITE_ONCE annotations.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com

    Jakub Sitnicki
     

16 Jan, 2020

1 commit

  • When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
    and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
    don't push the write_space callback up and instead simply overwrite the
    op with the psock stored previous op. This may or may not be correct so
    to ensure we don't overwrite the TLS write space hook pass this field to
    the ULP and have it fixup the ctx.

    This completes a previous fix that pushed the ops through to the ULP
    but at the time missed doing this for write_space, presumably because
    write_space TLS hook was added around the same time.

    Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Acked-by: Jonathan Lemon
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com

    John Fastabend
     

07 Dec, 2019

1 commit


29 Nov, 2019

1 commit

  • Partially sent record cleanup path increments an SG entry
    directly instead of using sg_next(). This should not be a
    problem today, as encrypted messages should be always
    allocated as arrays. But given this is a cleanup path it's
    easy to miss was this ever to change. Use sg_next(), and
    simplify the code.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

23 Nov, 2019

1 commit


20 Nov, 2019

1 commit

  • Bring back tls_sw_sendpage_locked. sk_msg redirection into a socket
    with TLS_TX takes the following path:

    tcp_bpf_sendmsg_redir
    tcp_bpf_push_locked
    tcp_bpf_push
    kernel_sendpage_locked
    sock->ops->sendpage_locked

    Also update the flags test in tls_sw_sendpage_locked to allow flag
    MSG_NO_SHARED_FRAGS. bpf_tcp_sendmsg sets this.

    Link: https://lore.kernel.org/netdev/CA+FuTSdaAawmZ2N8nfDDKu3XLpXBbMtcCT0q4FntDD2gn8ASUw@mail.gmail.com/T/#t
    Link: https://github.com/wdebruij/kerneltools/commits/icept.2
    Fixes: 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP")
    Fixes: f3de19af0f5b ("Revert \"net/tls: remove unused function tls_sw_sendpage_locked\"")
    Signed-off-by: Willem de Bruijn
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

10 Nov, 2019

1 commit


07 Nov, 2019

1 commit

  • TLS TX needs to release and re-acquire the socket lock if send buffer
    fills up.

    TLS SW TX path currently depends on only allowing one thread to enter
    the function by the abuse of sk_write_pending. If another writer is
    already waiting for memory no new ones are allowed in.

    This has two problems:
    - writers don't wake other threads up when they leave the kernel;
    meaning that this scheme works for single extra thread (second
    application thread or delayed work) because memory becoming
    available will send a wake up request, but as Mallesham and
    Pooja report with larger number of threads it leads to threads
    being put to sleep indefinitely;
    - the delayed work does not get _scheduled_ but it may _run_ when
    other writers are present leading to crashes as writers don't
    expect state to change under their feet (same records get pushed
    and freed multiple times); it's hard to reliably bail from the
    work, however, because the mere presence of a writer does not
    guarantee that the writer will push pending records before exiting.

    Ensuring wakeups always happen will make the code basically open
    code a mutex. Just use a mutex.

    The TLS HW TX path does not have any locking (not even the
    sk_write_pending hack), yet it uses a per-socket sg_tx_data
    array to push records.

    Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
    Reported-by: Mallesham Jatharakonda
    Reported-by: Pooja Trivedi
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

06 Oct, 2019

2 commits


05 Oct, 2019

6 commits


05 Sep, 2019

2 commits

  • TLS code has a number of #ifdefs which make the code a little
    harder to follow. Recent fixes removed the ifdef around the
    TLS_HW define, so we can switch to the often used pattern
    of defining tls_device functions as empty static inlines
    in the header when CONFIG_TLS_DEVICE=n.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: John Hurley
    Reviewed-by: Dirk van der Merwe
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Since we already have the pointer to the full original sk_proto
    stored use that instead of storing all individual callback
    pointers as well.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: John Hurley
    Reviewed-by: Dirk van der Merwe
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

01 Sep, 2019

2 commits


16 Aug, 2019

1 commit

  • The ctx->sk_write_space pointer is only set when TLS tx mode is enabled.
    When running without TX mode its a null pointer but we still set the
    sk sk_write_space pointer on close().

    Fix the close path to only overwrite sk->sk_write_space when the current
    pointer is to the tls_write_space function indicating the tls module should
    clean it up properly as well.

    Reported-by: Hillf Danton
    Cc: Ying Xue
    Cc: Andrey Konovalov
    Fixes: 57c722e932cfb ("net/tls: swap sk_write_space on close")
    Signed-off-by: John Fastabend
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    John Fastabend
     

10 Aug, 2019

1 commit

  • Now that we swap the original proto and clear the ULP pointer
    on close we have to make sure no callback will try to access
    the freed state. sk_write_space is not part of sk_prot, remember
    to swap it.

    Reported-by: syzbot+dcdc9deefaec44785f32@syzkaller.appspotmail.com
    Fixes: 95fa145479fb ("bpf: sockmap/tls, close can race with map free")
    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

06 Aug, 2019

1 commit

  • Looks like we were slightly overzealous with the shutdown()
    cleanup. Even though the sock->sk_state can reach CLOSED again,
    socket->state will not got back to SS_UNCONNECTED once
    connections is ESTABLISHED. Meaning we will see EISCONN if
    we try to reconnect, and EINVAL if we try to listen.

    Only listen sockets can be shutdown() and reused, but since
    ESTABLISHED sockets can never be re-connected() or used for
    listen() we don't need to try to clean up the ULP state early.

    Fixes: 32857cf57f92 ("net/tls: fix transition through disconnect with close")
    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

22 Jul, 2019

6 commits

  • When a map free is called and in parallel a socket is closed we
    have two paths that can potentially reset the socket prot ops, the
    bpf close() path and the map free path. This creates a problem
    with which prot ops should be used from the socket closed side.

    If the map_free side completes first then we want to call the
    original lowest level ops. However, if the tls path runs first
    we want to call the sockmap ops. Additionally there was no locking
    around prot updates in TLS code paths so the prot ops could
    be changed multiple times once from TLS path and again from sockmap
    side potentially leaving ops pointed at either TLS or sockmap
    when psock and/or tls context have already been destroyed.

    To fix this race first only update ops inside callback lock
    so that TLS, sockmap and lowest level all agree on prot state.
    Second and a ULP callback update() so that lower layers can
    inform the upper layer when they are being removed allowing the
    upper layer to reset prot ops.

    This gets us close to allowing sockmap and tls to be stacked
    in arbitrary order but will save that patch for *next trees.

    v4:
    - make sure we don't free things for device;
    - remove the checks which swap the callbacks back
    only if TLS is at the top.

    Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
    Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
    Signed-off-by: John Fastabend
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
    state via tcp_disconnect() without actually calling tcp_close which
    would then call the tls close callback. Because of this a user could
    disconnect a socket then put it in a LISTEN state which would break
    our assumptions about sockets always being ESTABLISHED state.

    More directly because close() can call unhash() and unhash is
    implemented by sockmap if a sockmap socket has TLS enabled we can
    incorrectly destroy the psock from unhash() and then call its close
    handler again. But because the psock (sockmap socket representation)
    is already destroyed we call close handler in sk->prot. However,
    in some cases (TLS BASE/BASE case) this will still point at the
    sockmap close handler resulting in a circular call and crash reported
    by syzbot.

    To fix both above issues implement the unhash() routine for TLS.

    v4:
    - add note about tls offload still needing the fix;
    - move sk_proto to the cold cache line;
    - split TX context free into "release" and "free",
    otherwise the GC work itself is in already freed
    memory;
    - more TX before RX for consistency;
    - reuse tls_ctx_free();
    - schedule the GC work after we're done with context
    to avoid UAF;
    - don't set the unhash in all modes, all modes "inherit"
    TLS_BASE's callbacks anyway;
    - disable the unhash hook for TLS_HW.

    Fixes: 3c4d7559159bf ("tls: kernel TLS support")
    Reported-by: Eric Dumazet
    Signed-off-by: John Fastabend
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • The tls close() callback currently drops the sock lock to call
    strp_done(). Split up the RX cleanup into stopping the strparser
    and releasing most resources, syncing strparser and finally
    freeing the context.

    To avoid the need for a strp_done() call on the cleanup path
    of device offload make sure we don't arm the strparser until
    we are sure init will be successful.

    Signed-off-by: John Fastabend
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • The tls close() callback currently drops the sock lock, makes a
    cancel_delayed_work_sync() call, and then relocks the sock.

    By restructuring the code we can avoid droping lock and then
    reclaiming it. To simplify this we do the following,

    tls_sk_proto_close
    set_bit(CLOSING)
    set_bit(SCHEDULE)
    cancel_delay_work_sync()
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • The deprecated TOE offload doesn't actually do anything in
    tls_sk_proto_close() - all TLS code is skipped and context
    not freed. Remove the callback to make it easier to refactor
    tls_sk_proto_close().

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • In tls_set_device_offload_rx() we prepare the software context
    for RX fallback and proceed to add the connection to the device.
    Unfortunately, software context prep includes arming strparser
    so in case of a later error we have to release the socket lock
    to call strp_done().

    In preparation for not releasing the socket lock half way through
    callbacks move arming strparser into a separate function.
    Following patches will make use of that.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     

02 Jul, 2019

1 commit

  • Commit 86029d10af18 ("tls: zero the crypto information from tls_context
    before freeing") added memzero_explicit() calls to clear the key material
    before freeing struct tls_context, but it missed tls_device.c has its
    own way of freeing this structure. Replace the missing free.

    Fixes: 86029d10af18 ("tls: zero the crypto information from tls_context before freeing")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: David S. Miller

    Jakub Kicinski