18 Nov, 2020

5 commits

  • When skb has a frag_list its possible for skb_to_sgvec() to fail. This
    happens when the scatterlist has fewer elements to store pages than would
    be needed for the initial skb plus any of its frags.

    This case appears rare, but is possible when running an RX parser/verdict
    programs exposed to the internet. Currently, when this happens we throw
    an error, break the pipe, and kfree the msg. This effectively breaks the
    application or forces it to do a retry.

    Lets catch this case and handle it by doing an skb_linearize() on any
    skb we receive with frags. At this point skb_to_sgvec should not fail
    because the failing conditions would require frags to be in place.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556576837.73229.14800682790808797635.stgit@john-XPS-13-9370

    John Fastabend
     
  • If the skb_verdict_prog redirects an skb knowingly to itself, fix your
    BPF program this is not optimal and an abuse of the API please use
    SK_PASS. That said there may be cases, such as socket load balancing,
    where picking the socket is hashed based or otherwise picks the same
    socket it was received on in some rare cases. If this happens we don't
    want to confuse userspace giving them an EAGAIN error if we can avoid
    it.

    To avoid double accounting in these cases. At the moment even if the
    skb has already been charged against the sockets rcvbuf and forward
    alloc we check it again and do set_owner_r() causing it to be orphaned
    and recharged. For one this is useless work, but more importantly we
    can have a case where the skb could be put on the ingress queue, but
    because we are under memory pressure we return EAGAIN. The trouble
    here is the skb has already been accounted for so any rcvbuf checks
    include the memory associated with the packet already. This rolls
    up and can result in unnecessary EAGAIN errors in userspace read()
    calls.

    Fix by doing an unlikely check and skipping checks if skb->sk == sk.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556574804.73229.11328201020039674147.stgit@john-XPS-13-9370

    John Fastabend
     
  • If a socket redirects to itself and it is under memory pressure it is
    possible to get a socket stuck so that recv() returns EAGAIN and the
    socket can not advance for some time. This happens because when
    redirecting a skb to the same socket we received the skb on we first
    check if it is OK to enqueue the skb on the receiving socket by checking
    memory limits. But, if the skb is itself the object holding the memory
    needed to enqueue the skb we will keep retrying from kernel side
    and always fail with EAGAIN. Then userspace will get a recv() EAGAIN
    error if there are no skbs in the psock ingress queue. This will continue
    until either some skbs get kfree'd causing the memory pressure to
    reduce far enough that we can enqueue the pending packet or the
    socket is destroyed. In some cases its possible to get a socket
    stuck for a noticeable amount of time if the socket is only receiving
    skbs from sk_skb verdict programs. To reproduce I make the socket
    memory limits ridiculously low so sockets are always under memory
    pressure. More often though if under memory pressure it looks like
    a spurious EAGAIN error on user space side causing userspace to retry
    and typically enough has moved on the memory side that it works.

    To fix skip memory checks and skb_orphan if receiving on the same
    sock as already assigned.

    For SK_PASS cases this is easy, its always the same socket so we
    can just omit the orphan/set_owner pair.

    For backlog cases we need to check skb->sk and decide if the orphan
    and set_owner pair are needed.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370

    John Fastabend
     
  • We use skb->size with sk_rmem_scheduled() which is not correct. Instead
    use truesize to align with socket and tcp stack usage of sk_rmem_schedule.

    Suggested-by: Daniel Borkman
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556570616.73229.17003722112077507863.stgit@john-XPS-13-9370

    John Fastabend
     
  • Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
    allows users to tune SO_RCVBUF and sockmap will honor them.

    We can refactor the if(charge) case out in later patches. But, keep this
    fix to the point.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Suggested-by: Jakub Sitnicki
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370

    John Fastabend
     

12 Oct, 2020

7 commits

  • Currently, we often run with a nop parser namely one that just does
    this, 'return skb->len'. This happens when either our verdict program
    can handle streaming data or it is only looking at socket data such
    as IP addresses and other metadata associated with the flow. The second
    case is common for a L3/L4 proxy for instance.

    So lets allow loading programs without the parser then we can skip
    the stream parser logic and avoid having to add a BPF program that
    is effectively a nop.

    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/160239297866.8495.13345662302749219672.stgit@john-Precision-5820-Tower

    John Fastabend
     
  • Move skb->sk assignment out of sk_psock_bpf_run() and into individual
    callers. Then we can use proper skb_set_owner_r() call to assign a
    sk to a skb. This improves things by also charging the truesize against
    the sockets sk_rmem_alloc counter. With this done we get some accounting
    in place to ensure the memory associated with skbs on the workqueue are
    still being accounted for somewhere. Finally, by using skb_set_owner_r
    the destructor is setup so we can just let the normal skb_kfree logic
    recover the memory. Combined with previous patch dropping skb_orphan()
    we now can recover from memory pressure and maintain accounting.

    Note, we will charge the skbs against their originating socket even
    if being redirected into another socket. Once the skb completes the
    redirect op the kfree_skb will give the memory back. This is important
    because if we charged the socket we are redirecting to (like it was
    done before this series) the sock_writeable() test could fail because
    of the skb trying to be sent is already charged against the socket.

    Also TLS case is special. Here we wait until we have decided not to
    simply PASS the packet up the stack. In the case where we PASS the
    packet up the stack we already have an skb which is accounted for on
    the TLS socket context.

    For the parser case we continue to just set/clear skb->sk this is
    because the skb being used here may be combined with other skbs or
    turned into multiple skbs depending on the parser logic. For example
    the parser could request a payload length greater than skb->len so
    that the strparser needs to collect multiple skbs. At any rate
    the final result will be handled in the strparser recv callback.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/160226867513.5692.10579573214635925960.stgit@john-Precision-5820-Tower

    John Fastabend
     
  • Calling skb_orphan() is unnecessary in the strp rcv handler because the skb
    is from a skb_clone() in __strp_recv. So it never has a destructor or a
    sk assigned. Plus its confusing to read because it might hint to the reader
    that the skb could have an sk assigned which is not true. Even if we did
    have an sk assigned it would be cleaner to simply wait for the upcoming
    kfree_skb().

    Additionally, move the comment about strparser clone up so its closer to
    the logic it is describing and add to it so that it is more complete.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/160226865548.5692.9098315689984599579.stgit@john-Precision-5820-Tower

    John Fastabend
     
  • In the sk_skb redirect case we didn't handle the case where we overrun
    the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
    Because we didn't have anything implemented we simply dropped the skb.
    This meant data could be dropped if socket memory accounting was in
    place.

    This fixes the above dropped data case by moving the memory checks
    later in the code where we actually do the send or recv. This pushes
    those checks into the workqueue and allows us to return an EAGAIN error
    which in turn allows us to try again later from the workqueue.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/160226863689.5692.13861422742592309285.stgit@john-Precision-5820-Tower

    John Fastabend
     
  • The skb_set_owner_w is unnecessary here. The sendpage call will create a
    fresh skb and set the owner correctly from workqueue. Its also not entirely
    harmless because it consumes cycles, but also impacts resource accounting
    by increasing sk_wmem_alloc. This is charging the socket we are going to
    send to for the skb, but we will put it on the workqueue for some time
    before this happens so we are artifically inflating sk_wmem_alloc for
    this period. Further, we don't know how many skbs will be used to send the
    packet or how it will be broken up when sent over the new socket so
    charging it with one big sum is also not correct when the workqueue may
    break it up if facing memory pressure. Seeing we don't know how/when
    this is going to be sent drop the early accounting.

    A later patch will do proper accounting charged on receive socket for
    the case where skbs get enqueued on the workqueue.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/160226861708.5692.17964237936462425136.stgit@john-Precision-5820-Tower

    John Fastabend
     
  • When we receive an skb and the ingress skb verdict program returns
    SK_PASS we currently set the ingress flag and put it on the workqueue
    so it can be turned into a sk_msg and put on the sk_msg ingress queue.
    Then finally telling userspace with data_ready hook.

    Here we observe that if the workqueue is empty then we can try to
    convert into a sk_msg type and call data_ready directly without
    bouncing through a workqueue. Its a common pattern to have a recv
    verdict program for visibility that always returns SK_PASS. In this
    case unless there is an ENOMEM error or we overrun the socket we
    can avoid the workqueue completely only using it when we fall back
    to error cases caused by memory pressure.

    By doing this we eliminate another case where data may be dropped
    if errors occur on memory limits in workqueue.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower

    John Fastabend
     
  • For sk_skb case where skb_verdict program returns SK_PASS to continue to
    pass packet up the stack, the memory limits were already checked before
    enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks
    here. The theory is if the TCP stack believes we have memory to receive
    the packet then lets trust the stack and not double check the limits.

    In fact the accounting here can cause a drop if sk_rmem_alloc has increased
    after the stack accepted this packet, but before the duplicate check here.
    And worse if this happens because TCP stack already believes the data has
    been received there is no retransmit.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower

    John Fastabend
     

05 Sep, 2020

1 commit

  • We got slightly different patches removing a double word
    in a comment in net/ipv4/raw.c - picked the version from net.

    Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
    values instead of VNIC login response buffer (following what
    commit 507ebe6444a4 ("ibmvnic: Fix use-after-free of VNIC login
    response buffer") did).

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

22 Aug, 2020

1 commit

  • Initializing psock->sk_proto and other saved callbacks is only
    done in sk_psock_update_proto, after sk_psock_init has returned.
    The logic for this is difficult to follow, and needlessly complex.

    Instead, initialize psock->sk_proto whenever we allocate a new
    psock. Additionally, assert the following invariants:

    * The SK has no ULP: ULP does it's own finagling of sk->sk_prot
    * sk_user_data is unused: we need it to store sk_psock

    Protect our access to sk_user_data with sk_callback_lock, which
    is what other users like reuseport arrays, etc. do.

    The result is that an sk_psock is always fully initialized, and
    that psock->sk_proto is always the "original" struct proto.
    The latter allows us to use psock->sk_proto when initializing
    IPv6 TCP / UDP callbacks for sockmap.

    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200821102948.21918-2-lmb@cloudflare.com

    Lorenz Bauer
     

28 Jun, 2020

2 commits

  • If an ingress verdict program specifies message sizes greater than
    skb->len and there is an ENOMEM error due to memory pressure we
    may call the rcv_msg handler outside the strp_data_ready() caller
    context. This is because on an ENOMEM error the strparser will
    retry from a workqueue. The caller currently protects the use of
    psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
    block.

    But, in above workqueue error case the psock is accessed outside
    the read_lock/unlock block of the caller. So instead of using
    psock directly we must do a look up against the sk again to
    ensure the psock is available.

    There is an an ugly piece here where we must handle
    the case where we paused the strp and removed the psock. On
    psock removal we first pause the strparser and then remove
    the psock. If the strparser is paused while an skb is
    scheduled on the workqueue the skb will be dropped on the
    flow and kfree_skb() is called. If the workqueue manages
    to get called before we pause the strparser but runs the rcvmsg
    callback after the psock is removed we will hit the unlikely
    case where we run the sockmap rcvmsg handler but do not have
    a psock. For now we will follow strparser logic and drop the
    skb on the floor with skb_kfree(). This is ugly because the
    data is dropped. To date this has not caused problems in practice
    because either the application controlling the sockmap is
    coordinating with the datapath so that skbs are "flushed"
    before removal or we simply wait for the sock to be closed before
    removing it.

    This patch fixes the describe RCU bug and dropping the skb doesn't
    make things worse. Future patches will improve this by allowing
    the normal case where skbs are not merged to skip the strparser
    altogether. In practice many (most?) use cases have no need to
    merge skbs so its both a code complexity hit as seen above and
    a performance issue. For example, in the Cilium case we always
    set the strparser up to return sbks 1:1 without any merging and
    have avoided above issues.

    Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370

    John Fastabend
     
  • There are two paths to generate the below RCU splat the first and
    most obvious is the result of the BPF verdict program issuing a
    redirect on a TLS socket (This is the splat shown below). Unlike
    the non-TLS case the caller of the *strp_read() hooks does not
    wrap the call in a rcu_read_lock/unlock. Then if the BPF program
    issues a redirect action we hit the RCU splat.

    However, in the non-TLS socket case the splat appears to be
    relatively rare, because the skmsg caller into the strp_data_ready()
    is wrapped in a rcu_read_lock/unlock. Shown here,

    static void sk_psock_strp_data_ready(struct sock *sk)
    {
    struct sk_psock *psock;

    rcu_read_lock();
    psock = sk_psock(sk);
    if (likely(psock)) {
    if (tls_sw_has_ctx_rx(sk)) {
    psock->parser.saved_data_ready(sk);
    } else {
    write_lock_bh(&sk->sk_callback_lock);
    strp_data_ready(&psock->parser.strp);
    write_unlock_bh(&sk->sk_callback_lock);
    }
    }
    rcu_read_unlock();
    }

    If the above was the only way to run the verdict program we
    would be safe. But, there is a case where the strparser may throw an
    ENOMEM error while parsing the skb. This is a result of a failed
    skb_clone, or alloc_skb_for_msg while building a new merged skb when
    the msg length needed spans multiple skbs. This will in turn put the
    skb on the strp_wrk workqueue in the strparser code. The skb will
    later be dequeued and verdict programs run, but now from a
    different context without the rcu_read_lock()/unlock() critical
    section in sk_psock_strp_data_ready() shown above. In practice
    I have not seen this yet, because as far as I know most users of the
    verdict programs are also only working on single skbs. In this case no
    merge happens which could trigger the above ENOMEM errors. In addition
    the system would need to be under memory pressure. For example, we
    can't hit the above case in selftests because we missed having tests
    to merge skbs. (Added in later patch)

    To fix the below splat extend the rcu_read_lock/unnlock block to
    include the call to sk_psock_tls_verdict_apply(). This will fix both
    TLS redirect case and non-TLS redirect+error case. Also remove
    psock from the sk_psock_tls_verdict_apply() function signature its
    not used there.

    [ 1095.937597] WARNING: suspicious RCU usage
    [ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W
    [ 1095.944363] -----------------------------
    [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
    [ 1095.950866]
    [ 1095.950866] other info that might help us debug this:
    [ 1095.950866]
    [ 1095.957146]
    [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
    [ 1095.961482] 1 lock held by test_sockmap/15970:
    [ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
    [ 1095.968568]
    [ 1095.968568] stack backtrace:
    [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1
    [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    [ 1095.980519] Call Trace:
    [ 1095.982191] dump_stack+0x8f/0xd0
    [ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0
    [ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250
    [ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls]

    v2: Improve commit message to identify non-TLS redirect plus error case
    condition as well as more common TLS case. In the process I decided
    doing the rcu_read_unlock followed by the lock/unlock inside branches
    was unnecessarily complex. We can just extend the current rcu block
    and get the same effeective without the shuffling and branching.
    Thanks Martin!

    Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
    Reported-by: Jakub Sitnicki
    Reported-by: kernel test robot
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370

    John Fastabend
     

02 Jun, 2020

2 commits

  • KTLS uses a stream parser to collect TLS messages and send them to
    the upper layer tls receive handler. This ensures the tls receiver
    has a full TLS header to parse when it is run. However, when a
    socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
    is enabled we end up with two stream parsers running on the same
    socket.

    The result is both try to run on the same socket. First the KTLS
    stream parser runs and calls read_sock() which will tcp_read_sock
    which in turn calls tcp_rcv_skb(). This dequeues the skb from the
    sk_receive_queue. When this is done KTLS code then data_ready()
    callback which because we stacked KTLS on top of the bpf stream
    verdict program has been replaced with sk_psock_start_strp(). This
    will in turn kick the stream parser again and eventually do the
    same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
    a skb from the sk_receive_queue.

    At this point the data stream is broke. Part of the stream was
    handled by the KTLS side some other bytes may have been handled
    by the BPF side. Generally this results in either missing data
    or more likely a "Bad Message" complaint from the kTLS receive
    handler as the BPF program steals some bytes meant to be in a
    TLS header and/or the TLS header length is no longer correct.

    We've already broke the idealized model where we can stack ULPs
    in any order with generic callbacks on the TX side to handle this.
    So in this patch we do the same thing but for RX side. We add
    a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
    program is running and add a tls_sw_has_ctx_rx() helper so BPF
    side can learn there is a TLS ULP on the socket.

    Then on BPF side we omit calling our stream parser to avoid
    breaking the data stream for the KTLS receiver. Then on the
    KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
    receiver is done with the packet but before it posts the
    msg to userspace. This gives us symmetry between the TX and
    RX halfs and IMO makes it usable again. On the TX side we
    process packets in this order BPF -> TLS -> TCP and on
    the receive side in the reverse order TCP -> TLS -> BPF.

    Discovered while testing OpenSSL 3.0 Alpha2.0 release.

    Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov

    John Fastabend
     
  • We will need this block of code called from tls context shortly
    lets refactor the redirect logic so its easy to use. This also
    cleans up the switch stmt so we have fewer fallthrough cases.

    No logic changes are intended.

    Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Jakub Sitnicki
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov

    John Fastabend
     

25 Feb, 2020

1 commit

  • All of these cases are strictly of the form:

    preempt_disable();
    BPF_PROG_RUN(...);
    preempt_enable();

    Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
    with:

    migrate_disable();
    BPF_PROG_RUN(...);
    migrate_enable();

    On non RT enabled kernels this maps to preempt_disable/enable() and on RT
    enabled kernels this solely prevents migration, which is sufficient as
    there is no requirement to prevent reentrancy to any BPF program from a
    preempting task. The only requirement is that the program stays on the same
    CPU.

    Therefore, this is a trivially correct transformation.

    The seccomp loop does not need protection over the loop. It only needs
    protection per BPF filter program

    [ tglx: Converted to bpf_prog_run_pin_on_cpu() ]

    Signed-off-by: David S. Miller
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de

    David Miller
     

22 Feb, 2020

1 commit

  • sk_user_data can hold a pointer to an object that is not intended to be
    shared between the parent socket and the child that gets a pointer copy on
    clone. This is the case when sk_user_data points at reference-counted
    object, like struct sk_psock.

    One way to resolve it is to tag the pointer with a no-copy flag by
    repurposing its lowest bit. Based on the bit-flag value we clear the child
    sk_user_data pointer after cloning the parent socket.

    The no-copy flag is stored in the pointer itself as opposed to externally,
    say in socket flags, to guarantee that the pointer and the flag are copied
    from parent to child socket in an atomic fashion. Parent socket state is
    subject to change while copying, we don't hold any locks at that time.

    This approach relies on an assumption that sk_user_data holds a pointer to
    an object aligned at least 2 bytes. A manual audit of existing users of
    rcu_dereference_sk_user_data helper confirms our assumption.

    Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
    char value or a pathological case of "struct { char c; }". To be safe, warn
    when the flag-bit is set when setting sk_user_data to catch any future
    misuses.

    It is worth considering why clearing sk_user_data unconditionally is not an
    option. There exist users, DRBD, NVMe, and Xen drivers being among them,
    that rely on the pointer being copied when cloning the listening socket.

    Potentially we could distinguish these users by checking if the listening
    socket has been created in kernel-space via sock_create_kern, and hence has
    sk_kern_sock flag set. However, this is not the case for NVMe and Xen
    drivers, which create sockets without marking them as belonging to the
    kernel.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com

    Jakub Sitnicki
     

23 Jan, 2020

1 commit

  • As John Fastabend reports [0], psock state tear-down can happen on receive
    path *after* unlocking the socket, if the only other psock user, that is
    sockmap or sockhash, releases its psock reference before tcp_bpf_recvmsg
    does so:

    tcp_bpf_recvmsg()
    psock = sk_psock_get(sk)
    Signed-off-by: Jakub Sitnicki
    Acked-by: John Fastabend
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     

16 Jan, 2020

1 commit

  • The sock_map_free() and sock_hash_free() paths used to delete sockmap
    and sockhash maps walk the maps and destroy psock and bpf state associated
    with the socks in the map. When done the socks no longer have BPF programs
    attached and will function normally. This can happen while the socks in
    the map are still "live" meaning data may be sent/received during the walk.

    Currently, though we don't take the sock_lock when the psock and bpf state
    is removed through this path. Specifically, this means we can be writing
    into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
    while they are also being called from the networking side. This is not
    safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
    believed it was safe. Further its not clear to me its even a good idea
    to try and do this on "live" sockets while networking side might also
    be using the socket. Instead of trying to reason about using the socks
    from both sides lets realize that every use case I'm aware of rarely
    deletes maps, in fact kubernetes/Cilium case builds map at init and
    never tears it down except on errors. So lets do the simple fix and
    grab sock lock.

    This patch wraps sock deletes from maps in sock lock and adds some
    annotations so we catch any other cases easier.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com

    John Fastabend
     

29 Nov, 2019

1 commit

  • TLS 1.3 started using the entry at the end of the SG array
    for chaining-in the single byte content type entry. This mostly
    works:

    [ E E E E E E . . ]
    ^ ^
    start end

    E < content type
    /
    [ E E E E E E C . ]
    ^ ^
    start end

    (Where E denotes a populated SG entry; C denotes a chaining entry.)

    If the array is full, however, the end will point to the start:

    [ E E E E E E E E ]
    ^
    start
    end

    And we end up overwriting the start:

    E < content type
    /
    [ C E E E E E E E ]
    ^
    start
    end

    The sg array is supposed to be a circular buffer with start and
    end markers pointing anywhere. In case where start > end
    (i.e. the circular buffer has "wrapped") there is an extra entry
    reserved at the end to chain the two halves together.

    [ E E E E E E . . l ]

    (Where l is the reserved entry for "looping" back to front.

    As suggested by John, let's reserve another entry for chaining
    SG entries after the main circular buffer. Note that this entry
    has to be pointed to by the end entry so its position is not fixed.

    Examples of full messages:

    [ E E E E E E E E . l ]
    ^ ^
    start end


    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

22 Nov, 2019

1 commit

  • Report from Dan Carpenter,

    net/core/skmsg.c:792 sk_psock_write_space()
    error: we previously assumed 'psock' could be null (see line 790)

    net/core/skmsg.c
    789 psock = sk_psock(sk);
    790 if (likely(psock && sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)))
    Check for NULL
    791 schedule_work(&psock->work);
    792 write_space = psock->saved_write_space;
    ^^^^^^^^^^^^^^^^^^^^^^^^
    793 rcu_read_unlock();
    794 write_space(sk);

    Ensure psock dereference on line 792 only occurs if psock is not null.

    Reported-by: Dan Carpenter
    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

06 Nov, 2019

1 commit

  • sk_msg_trim() tries to only update curr pointer if it falls into
    the trimmed region. The logic, however, does not take into the
    account pointer wrapping that sk_msg_iter_var_prev() does nor
    (as John points out) the fact that msg->sg is a ring buffer.

    This means that when the message was trimmed completely, the new
    curr pointer would have the value of MAX_MSG_FRAGS - 1, which is
    neither smaller than any other value, nor would it actually be
    correct.

    Special case the trimming to 0 length a little bit and rework
    the comparison between curr and end to take into account wrapping.

    This bug caused the TLS code to not copy all of the message, if
    zero copy filled in fewer sg entries than memcopy would need.

    Big thanks to Alexander Potapenko for the non-KMSAN reproducer.

    v2:
    - take into account that msg->sg is a ring buffer (John).

    Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ (v1)

    Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
    Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com
    Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com
    Co-developed-by: John Fastabend
    Signed-off-by: Jakub Kicinski
    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

25 Aug, 2019

1 commit


22 Jul, 2019

1 commit

  • When a map free is called and in parallel a socket is closed we
    have two paths that can potentially reset the socket prot ops, the
    bpf close() path and the map free path. This creates a problem
    with which prot ops should be used from the socket closed side.

    If the map_free side completes first then we want to call the
    original lowest level ops. However, if the tls path runs first
    we want to call the sockmap ops. Additionally there was no locking
    around prot updates in TLS code paths so the prot ops could
    be changed multiple times once from TLS path and again from sockmap
    side potentially leaving ops pointed at either TLS or sockmap
    when psock and/or tls context have already been destroyed.

    To fix this race first only update ops inside callback lock
    so that TLS, sockmap and lowest level all agree on prot state.
    Second and a ULP callback update() so that lower layers can
    inform the upper layer when they are being removed allowing the
    upper layer to reset prot ops.

    This gets us close to allowing sockmap and tls to be stacked
    in arbitrary order but will save that patch for *next trees.

    v4:
    - make sure we don't free things for device;
    - remove the checks which swap the callbacks back
    only if TLS is at the top.

    Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
    Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
    Signed-off-by: John Fastabend
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Signed-off-by: Daniel Borkmann

    John Fastabend
     

14 May, 2019

2 commits

  • When converting a skb to msg->sg we forget to set the size after the
    latest ktls/tls code conversion. This patch can be reached by doing
    a redir into ingress path from BPF skb sock recv hook. Then trying to
    read the size fails.

    Fix this by setting the size.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • If we try to call strp_done on a parser that has never been
    initialized, because the sockmap user is only using TX side for
    example we get the following error.

    [ 883.422081] WARNING: CPU: 1 PID: 208 at kernel/workqueue.c:3030 __flush_work+0x1ca/0x1e0
    ...
    [ 883.422095] Workqueue: events sk_psock_destroy_deferred
    [ 883.422097] RIP: 0010:__flush_work+0x1ca/0x1e0

    This had been wrapped in a 'if (psock->parser.enabled)' logic which
    was broken because the strp_done() was never actually being called
    because we do a strp_stop() earlier in the tear down logic will
    set parser.enabled to false. This could result in a use after free
    if work was still in the queue and was resolved by the patch here,
    1d79895aef18f ("sk_msg: Always cancel strp work before freeing the
    psock"). However, calling strp_stop(), done by the patch marked in
    the fixes tag, only is useful if we never initialized a strp parser
    program and never initialized the strp to start with. Because if
    we had initialized a stream parser strp_stop() would have been called
    by sk_psock_drop() earlier in the tear down process. By forcing the
    strp to stop we get past the WARNING in strp_done that checks
    the stopped flag but calling cancel_work_sync on work that has never
    been initialized is also wrong and generates the warning above.

    To fix check if the parser program exists. If the program exists
    then the strp work has been initialized and must be sync'd and
    cancelled before free'ing any structures. If no program exists we
    never initialized the stream parser in the first place so skip the
    sync/cancel logic implemented by strp_done.

    Finally, remove the strp_done its not needed and in the case where we
    are using the stream parser has already been called.

    Fixes: e8e3437762ad9 ("bpf: Stop the psock parser before canceling its work")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend
     

07 Mar, 2019

1 commit

  • We might have never enabled (started) the psock's parser, in which case it
    will not get stopped when destroying the psock. This leads to a warning
    when trying to cancel parser's work from psock's deferred destructor:

    [ 405.325769] WARNING: CPU: 1 PID: 3216 at net/strparser/strparser.c:526 strp_done+0x3c/0x40
    [ 405.326712] Modules linked in: [last unloaded: test_bpf]
    [ 405.327359] CPU: 1 PID: 3216 Comm: kworker/1:164 Tainted: G W 5.0.0 #42
    [ 405.328294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
    [ 405.329712] Workqueue: events sk_psock_destroy_deferred
    [ 405.330254] RIP: 0010:strp_done+0x3c/0x40
    [ 405.330706] Code: 28 e8 b8 d5 6b ff 48 8d bb 80 00 00 00 e8 9c d5 6b ff 48 8b 7b 18 48 85 ff 74 0d e8 1e a5 e8 ff 48 c7 43 18 00 00 00 00 5b c3 0b eb cf 66 66 66 66 90 55 89 f5 53 48 89 fb 48 83 c7 28 e8 0b
    [ 405.332862] RSP: 0018:ffffc900026bbe50 EFLAGS: 00010246
    [ 405.333482] RAX: ffffffff819323e0 RBX: ffff88812cb83640 RCX: ffff88812cb829e8
    [ 405.334228] RDX: 0000000000000001 RSI: ffff88812cb837e8 RDI: ffff88812cb83640
    [ 405.335366] RBP: ffff88813fd22680 R08: 0000000000000000 R09: 000073746e657665
    [ 405.336472] R10: 8080808080808080 R11: 0000000000000001 R12: ffff88812cb83600
    [ 405.337760] R13: 0000000000000000 R14: ffff88811f401780 R15: ffff88812cb837e8
    [ 405.338777] FS: 0000000000000000(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
    [ 405.339903] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 405.340821] CR2: 00007fb11489a6b8 CR3: 000000012d4d6000 CR4: 00000000000406e0
    [ 405.341981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 405.343131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 405.344415] Call Trace:
    [ 405.344821] sk_psock_destroy_deferred+0x23/0x1b0
    [ 405.345585] process_one_work+0x1ae/0x3e0
    [ 405.346110] worker_thread+0x3c/0x3b0
    [ 405.346576] ? pwq_unbound_release_workfn+0xd0/0xd0
    [ 405.347187] kthread+0x11d/0x140
    [ 405.347601] ? __kthread_parkme+0x80/0x80
    [ 405.348108] ret_from_fork+0x35/0x40
    [ 405.348566] ---[ end trace a4a3af4026a327d4 ]---

    Stop psock's parser just before canceling its work.

    Fixes: 1d79895aef18 ("sk_msg: Always cancel strp work before freeing the psock")
    Reported-by: kernel test robot
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann

    Jakub Sitnicki
     

09 Feb, 2019

1 commit


29 Jan, 2019

1 commit

  • Despite having stopped the parser, we still need to deinitialize it
    by calling strp_done so that it cancels its work. Otherwise the worker
    thread can run after we have freed the parser, and attempt to access
    its workqueue resulting in a use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in pwq_activate_delayed_work+0x1b/0x1d0
    Read of size 8 at addr ffff888069975240 by task kworker/u2:2/93

    CPU: 0 PID: 93 Comm: kworker/u2:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
    Workqueue: (null) (kstrp)
    Call Trace:
    print_address_description+0x6e/0x2b0
    ? pwq_activate_delayed_work+0x1b/0x1d0
    kasan_report+0xfd/0x177
    ? pwq_activate_delayed_work+0x1b/0x1d0
    ? pwq_activate_delayed_work+0x1b/0x1d0
    pwq_activate_delayed_work+0x1b/0x1d0
    ? process_one_work+0x4aa/0x660
    pwq_dec_nr_in_flight+0x9b/0x100
    worker_thread+0x82/0x680
    ? process_one_work+0x660/0x660
    kthread+0x1b9/0x1e0
    ? __kthread_create_on_node+0x250/0x250
    ret_from_fork+0x1f/0x30

    Allocated by task 111:
    sk_psock_init+0x3c/0x1b0
    sock_map_link.isra.2+0x103/0x4b0
    sock_map_update_common+0x94/0x270
    sock_map_update_elem+0x145/0x160
    __se_sys_bpf+0x152e/0x1e10
    do_syscall_64+0xb2/0x3e0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 112:
    kfree+0x7f/0x140
    process_one_work+0x40b/0x660
    worker_thread+0x82/0x680
    kthread+0x1b9/0x1e0
    ret_from_fork+0x1f/0x30

    The buggy address belongs to the object at ffff888069975180
    which belongs to the cache kmalloc-512 of size 512
    The buggy address is located 192 bytes inside of
    512-byte region [ffff888069975180, ffff888069975380)
    The buggy address belongs to the page:
    page:ffffea0001a65d00 count:1 mapcount:0 mapping:ffff88806d401280 index:0x0 compound_mapcount: 0
    flags: 0x4000000000010200(slab|head)
    raw: 4000000000010200 dead000000000100 dead000000000200 ffff88806d401280
    raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff888069975100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff888069975180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff888069975200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff888069975280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff888069975300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Reported-by: Marek Majkowski
    Signed-off-by: Jakub Sitnicki
    Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Jakub Sitnicki
     

18 Jan, 2019

1 commit

  • Function sk_msg_clone has been modified to merge the data from source sg
    entry to destination sg entry if the cloned data resides in same page
    and is contiguous to the end entry of destination sk_msg. This improves
    kernel tls throughput to the tune of 10%.

    When the user space tls application calls sendmsg() with MSG_MORE, it leads
    to calling sk_msg_clone() with new data being cloned placed continuous to
    previously cloned data. Without this optimization, a new SG entry in
    the destination sk_msg i.e. rec->msg_plaintext in tls_clone_plaintext_msg()
    gets used. This leads to exhaustion of sg entries in rec->msg_plaintext
    even before a full 16K of allowable record data is accumulated. Hence we
    lose oppurtunity to encrypt and send a full 16K record.

    With this patch, the kernel tls can accumulate full 16K of record data
    irrespective of the size of data passed in sendmsg() with MSG_MORE.

    Signed-off-by: Vakul Garg
    Signed-off-by: David S. Miller

    Vakul Garg
     

28 Dec, 2018

1 commit

  • Pull networking updates from David Miller:

    1) New ipset extensions for matching on destination MAC addresses, from
    Stefano Brivio.

    2) Add ipv4 ttl and tos, plus ipv6 flow label and hop limit offloads to
    nfp driver. From Stefano Brivio.

    3) Implement GRO for plain UDP sockets, from Paolo Abeni.

    4) Lots of work from Michał Mirosław to eliminate the VLAN_TAG_PRESENT
    bit so that we could support the entire vlan_tci value.

    5) Rework the IPSEC policy lookups to better optimize more usecases,
    from Florian Westphal.

    6) Infrastructure changes eliminating direct manipulation of SKB lists
    wherever possible, and to always use the appropriate SKB list
    helpers. This work is still ongoing...

    7) Lots of PHY driver and state machine improvements and
    simplifications, from Heiner Kallweit.

    8) Various TSO deferral refinements, from Eric Dumazet.

    9) Add ntuple filter support to aquantia driver, from Dmitry Bogdanov.

    10) Batch dropping of XDP packets in tuntap, from Jason Wang.

    11) Lots of cleanups and improvements to the r8169 driver from Heiner
    Kallweit, including support for ->xmit_more. This driver has been
    getting some much needed love since he started working on it.

    12) Lots of new forwarding selftests from Petr Machata.

    13) Enable VXLAN learning in mlxsw driver, from Ido Schimmel.

    14) Packed ring support for virtio, from Tiwei Bie.

    15) Add new Aquantia AQtion USB driver, from Dmitry Bezrukov.

    16) Add XDP support to dpaa2-eth driver, from Ioana Ciocoi Radulescu.

    17) Implement coalescing on TCP backlog queue, from Eric Dumazet.

    18) Implement carrier change in tun driver, from Nicolas Dichtel.

    19) Support msg_zerocopy in UDP, from Willem de Bruijn.

    20) Significantly improve garbage collection of neighbor objects when
    the table has many PERMANENT entries, from David Ahern.

    21) Remove egdev usage from nfp and mlx5, and remove the facility
    completely from the tree as it no longer has any users. From Oz
    Shlomo and others.

    22) Add a NETDEV_PRE_CHANGEADDR so that drivers can veto the change and
    therefore abort the operation before the commit phase (which is the
    NETDEV_CHANGEADDR event). From Petr Machata.

    23) Add indirect call wrappers to avoid retpoline overhead, and use them
    in the GRO code paths. From Paolo Abeni.

    24) Add support for netlink FDB get operations, from Roopa Prabhu.

    25) Support bloom filter in mlxsw driver, from Nir Dotan.

    26) Add SKB extension infrastructure. This consolidates the handling of
    the auxiliary SKB data used by IPSEC and bridge netfilter, and is
    designed to support the needs to MPTCP which could be integrated in
    the future.

    27) Lots of XDP TX optimizations in mlx5 from Tariq Toukan.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1845 commits)
    net: dccp: fix kernel crash on module load
    drivers/net: appletalk/cops: remove redundant if statement and mask
    bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
    net/net_namespace: Check the return value of register_pernet_subsys()
    net/netlink_compat: Fix a missing check of nla_parse_nested
    ieee802154: lowpan_header_create check must check daddr
    net/mlx4_core: drop useless LIST_HEAD
    mlxsw: spectrum: drop useless LIST_HEAD
    net/mlx5e: drop useless LIST_HEAD
    iptunnel: Set tun_flags in the iptunnel_metadata_reply from src
    net/mlx5e: fix semicolon.cocci warnings
    staging: octeon: fix build failure with XFRM enabled
    net: Revert recent Spectre-v1 patches.
    can: af_can: Fix Spectre v1 vulnerability
    packet: validate address length if non-zero
    nfc: af_nfc: Fix Spectre v1 vulnerability
    phonet: af_phonet: Fix Spectre v1 vulnerability
    net: core: Fix Spectre v1 vulnerability
    net: minor cleanup in skb_ext_add()
    net: drop the unused helper skb_ext_get()
    ...

    Linus Torvalds
     

27 Dec, 2018

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The biggest RCU changes in this cycle were:

    - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

    - Replace calls of RCU-bh and RCU-sched update-side functions to
    their vanilla RCU counterparts. This series is a step towards
    complete removal of the RCU-bh and RCU-sched update-side functions.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - Documentation updates, including a number of flavor-consolidation
    updates from Joel Fernandes.

    - Miscellaneous fixes.

    - Automate generation of the initrd filesystem used for rcutorture
    testing.

    - Convert spin_is_locked() assertions to instead use lockdep.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - SRCU updates, especially including a fix from Dennis Krein for a
    bag-on-head-class bug.

    - RCU torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    rcutorture: Don't do busted forward-progress testing
    rcutorture: Use 100ms buckets for forward-progress callback histograms
    rcutorture: Recover from OOM during forward-progress tests
    rcutorture: Print forward-progress test age upon failure
    rcutorture: Print time since GP end upon forward-progress failure
    rcutorture: Print histogram of CB invocation at OOM time
    rcutorture: Print GP age upon forward-progress failure
    rcu: Print per-CPU callback counts for forward-progress failures
    rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
    rcutorture: Dump grace-period diagnostics upon forward-progress OOM
    rcutorture: Prepare for asynchronous access to rcu_fwd_startat
    torture: Remove unnecessary "ret" variables
    rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
    rcutorture: Break up too-long rcu_torture_fwd_prog() function
    rcutorture: Remove cbflood facility
    torture: Bring any extra CPUs online during kernel startup
    rcutorture: Add call_rcu() flooding forward-progress tests
    rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
    tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
    net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
    ...

    Linus Torvalds
     

22 Dec, 2018

2 commits

  • David S. Miller
     
  • Fixed function sk_msg_clone() to prevent overflow of 'dst' while adding
    pages in scatterlist entries. The overflow of 'dst' causes crash in kernel
    tls module while doing record encryption.

    Crash fixed by this patch.

    [ 78.796119] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
    [ 78.804900] Mem abort info:
    [ 78.807683] ESR = 0x96000004
    [ 78.810744] Exception class = DABT (current EL), IL = 32 bits
    [ 78.816677] SET = 0, FnV = 0
    [ 78.819727] EA = 0, S1PTW = 0
    [ 78.822873] Data abort info:
    [ 78.825759] ISV = 0, ISS = 0x00000004
    [ 78.829600] CM = 0, WnR = 0
    [ 78.832576] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000bf8ee311
    [ 78.839195] [0000000000000008] pgd=0000000000000000
    [ 78.844081] Internal error: Oops: 96000004 [#1] PREEMPT SMP
    [ 78.849642] Modules linked in: tls xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables xt_CHECKSUM cpve cpufreq_conservative lm90 ina2xx crct10dif_ce
    [ 78.865377] CPU: 0 PID: 6007 Comm: openssl Not tainted 4.20.0-rc6-01647-g754d5da63145-dirty #107
    [ 78.874149] Hardware name: LS1043A RDB Board (DT)
    [ 78.878844] pstate: 60000005 (nZCv daif -PAN -UAO)
    [ 78.883632] pc : scatterwalk_copychunks+0x164/0x1c8
    [ 78.888500] lr : scatterwalk_copychunks+0x160/0x1c8
    [ 78.893366] sp : ffff00001d04b600
    [ 78.896668] x29: ffff00001d04b600 x28: ffff80006814c680
    [ 78.901970] x27: 0000000000000000 x26: ffff80006c8de786
    [ 78.907272] x25: ffff00001d04b760 x24: 000000000000001a
    [ 78.912573] x23: 0000000000000006 x22: ffff80006814e440
    [ 78.917874] x21: 0000000000000100 x20: 0000000000000000
    [ 78.923175] x19: 000081ffffffffff x18: 0000000000000400
    [ 78.928476] x17: 0000000000000008 x16: 0000000000000000
    [ 78.933778] x15: 0000000000000100 x14: 0000000000000001
    [ 78.939079] x13: 0000000000001080 x12: 0000000000000020
    [ 78.944381] x11: 0000000000001080 x10: 00000000ffff0002
    [ 78.949683] x9 : ffff80006814c248 x8 : 00000000ffff0000
    [ 78.954985] x7 : ffff80006814c318 x6 : ffff80006c8de786
    [ 78.960286] x5 : 0000000000000f80 x4 : ffff80006c8de000
    [ 78.965588] x3 : 0000000000000000 x2 : 0000000000001086
    [ 78.970889] x1 : ffff7e0001b74e02 x0 : 0000000000000000
    [ 78.976192] Process openssl (pid: 6007, stack limit = 0x00000000291367f9)
    [ 78.982968] Call trace:
    [ 78.985406] scatterwalk_copychunks+0x164/0x1c8
    [ 78.989927] skcipher_walk_next+0x28c/0x448
    [ 78.994099] skcipher_walk_done+0xfc/0x258
    [ 78.998187] gcm_encrypt+0x434/0x4c0
    [ 79.001758] tls_push_record+0x354/0xa58 [tls]
    [ 79.006194] bpf_exec_tx_verdict+0x1e4/0x3e8 [tls]
    [ 79.010978] tls_sw_sendmsg+0x650/0x780 [tls]
    [ 79.015326] inet_sendmsg+0x2c/0xf8
    [ 79.018806] sock_sendmsg+0x18/0x30
    [ 79.022284] __sys_sendto+0x104/0x138
    [ 79.025935] __arm64_sys_sendto+0x24/0x30
    [ 79.029936] el0_svc_common+0x60/0xe8
    [ 79.033588] el0_svc_handler+0x2c/0x80
    [ 79.037327] el0_svc+0x8/0xc
    [ 79.040200] Code: 6b01005f 54fff788 940169b1 f9000320 (b9400801)
    [ 79.046283] ---[ end trace 74db007d069c1cf7 ]---

    Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
    Signed-off-by: Vakul Garg
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Vakul Garg
     

21 Dec, 2018

2 commits

  • In addition to releasing any cork'ed data on a psock when the psock
    is removed we should also release any skb's in the ingress work queue.
    Otherwise the skb's eventually get free'd but late in the tear
    down process so we see the WARNING due to non-zero sk_forward_alloc.

    void sk_stream_kill_queues(struct sock *sk)
    {
    ...
    WARN_ON(sk->sk_forward_alloc);
    ...
    }

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • When a skb verdict program is in-use and either another BPF program
    redirects to that socket or the new SK_PASS support is used the
    data_ready callback does not wake up application. Instead because
    the stream parser/verdict is using the sk data_ready callback we wake
    up the stream parser/verdict block.

    Fix this by adding a helper to check if the stream parser block is
    enabled on the sk and if so call the saved pointer which is the
    upper layers wake up function.

    This fixes application stalls observed when an application is waiting
    for data in a blocking read().

    Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend