08 Sep, 2022

1 commit

  • [ Upstream commit 583585e48d965338e73e1eb383768d16e0922d73 ]

    Fix one kernel NULL pointer dereference as below:

    [ 224.462334] Call Trace:
    [ 224.462394] __tcp_bpf_recvmsg+0xd3/0x380
    [ 224.462441] ? sock_has_perm+0x78/0xa0
    [ 224.462463] tcp_bpf_recvmsg+0x12e/0x220
    [ 224.462494] inet_recvmsg+0x5b/0xd0
    [ 224.462534] __sys_recvfrom+0xc8/0x130
    [ 224.462574] ? syscall_trace_enter+0x1df/0x2e0
    [ 224.462606] ? __do_page_fault+0x2de/0x500
    [ 224.462635] __x64_sys_recvfrom+0x24/0x30
    [ 224.462660] do_syscall_64+0x5d/0x1d0
    [ 224.462709] entry_SYSCALL_64_after_hwframe+0x65/0xca

    In commit 9974d37ea75f ("skmsg: Fix invalid last sg check in
    sk_msg_recvmsg()"), we change last sg check to sg_is_last(),
    but in sockmap redirection case (without stream_parser/stream_verdict/
    skb_verdict), we did not mark the end of the scatterlist. Check the
    sk_msg_alloc, sk_msg_page_add, and bpf_msg_push_data functions, they all
    do not mark the end of sg. They are expected to use sg.end for end
    judgment. So the judgment of '(i != msg_rx->sg.end)' is added back here.

    Fixes: 9974d37ea75f ("skmsg: Fix invalid last sg check in sk_msg_recvmsg()")
    Signed-off-by: Liu Jian
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20220809094915.150391-1-liujian56@huawei.com
    Signed-off-by: Sasha Levin

    Liu Jian
     

05 Sep, 2022

1 commit

  • commit 2a0133723f9ebeb751cfce19f74ec07e108bef1f upstream.

    Syzkaller reports refcount bug as follows:
    ------------[ cut here ]------------
    refcount_t: saturated; leaking memory.
    WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
    Modules linked in:
    CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0

    __refcount_add_not_zero include/linux/refcount.h:163 [inline]
    __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
    refcount_inc_not_zero include/linux/refcount.h:245 [inline]
    sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
    tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
    tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
    tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
    tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
    tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
    sk_backlog_rcv include/net/sock.h:1061 [inline]
    __release_sock+0x134/0x3b0 net/core/sock.c:2849
    release_sock+0x54/0x1b0 net/core/sock.c:3404
    inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
    __sys_shutdown_sock net/socket.c:2331 [inline]
    __sys_shutdown_sock net/socket.c:2325 [inline]
    __sys_shutdown+0xf1/0x1b0 net/socket.c:2343
    __do_sys_shutdown net/socket.c:2351 [inline]
    __se_sys_shutdown net/socket.c:2349 [inline]
    __x64_sys_shutdown+0x50/0x70 net/socket.c:2349
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x46/0xb0

    During SMC fallback process in connect syscall, kernel will
    replaces TCP with SMC. In order to forward wakeup
    smc socket waitqueue after fallback, kernel will sets
    clcsk->sk_user_data to origin smc socket in
    smc_fback_replace_callbacks().

    Later, in shutdown syscall, kernel will calls
    sk_psock_get(), which treats the clcsk->sk_user_data
    as psock type, triggering the refcnt warning.

    So, the root cause is that smc and psock, both will use
    sk_user_data field. So they will mismatch this field
    easily.

    This patch solves it by using another bit(defined as
    SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
    sk_user_data points to a psock object or not.
    This patch depends on a PTRMASK introduced in commit f1ff5ce2cd5e
    ("net, sk_msg: Clear sk_user_data pointer on clone if tagged").

    For there will possibly be more flags in the sk_user_data field,
    this patch also refactor sk_user_data flags code to be more generic
    to improve its maintainability.

    Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
    Suggested-by: Jakub Kicinski
    Acked-by: Wen Gu
    Signed-off-by: Hawkins Jiawei
    Reviewed-by: Jakub Sitnicki
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Hawkins Jiawei
     

17 Aug, 2022

1 commit

  • [ Upstream commit 9974d37ea75f01b47d16072b5dad305bd8d23fcc ]

    In sk_psock_skb_ingress_enqueue function, if the linear area + nr_frags +
    frag_list of the SKB has NR_MSG_FRAG_IDS blocks in total, skb_to_sgvec
    will return NR_MSG_FRAG_IDS, then msg->sg.end will be set to
    NR_MSG_FRAG_IDS, and in addition, (NR_MSG_FRAG_IDS - 1) is set to the last
    SG of msg. Recv the msg in sk_msg_recvmsg, when i is (NR_MSG_FRAG_IDS - 1),
    the sk_msg_iter_var_next(i) will change i to 0 (not NR_MSG_FRAG_IDS), the
    judgment condition "msg_rx->sg.start==msg_rx->sg.end" and
    "i != msg_rx->sg.end" can not work.

    As a result, the processed msg cannot be deleted from ingress_msg list.
    But the length of all the sge of the msg has changed to 0. Then the next
    recvmsg syscall will process the msg repeatedly, because the length of sge
    is 0, the -EFAULT error is always returned.

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Liu Jian
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20220628123616.186950-1-liujian56@huawei.com
    Signed-off-by: Sasha Levin

    Liu Jian
     

29 Jun, 2022

1 commit

  • [ Upstream commit e34a07c0ae3906f97eb18df50902e2a01c1015b6 ]

    Commit 8a59f9d1e3d4 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
    has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to
    the new tcp_bpf_update_proto() function. I'm guessing that this
    was done to allow creating psocks for non-inet sockets.

    Unfortunately the destruction path for psock includes the ULP
    unwind, so we need to fail the sk_psock_init() itself.
    Otherwise if ULP is already present we'll notice that later,
    and call tcp_update_ulp() with the sk_proto of the ULP
    itself, which will most likely result in the ULP looping
    its callbacks.

    Fixes: 8a59f9d1e3d4 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: John Fastabend
    Reviewed-by: Jakub Sitnicki
    Tested-by: Jakub Sitnicki
    Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.org
    Signed-off-by: Paolo Abeni
    Signed-off-by: Sasha Levin

    Jakub Kicinski
     

08 Apr, 2022

1 commit

  • [ Upstream commit 9c34e38c4a870eb30b13f42f5b44f42e9d19ccb8 ]

    If tcp_bpf_sendmsg() is running while sk msg is full. When sk_msg_alloc()
    returns -ENOMEM error, tcp_bpf_sendmsg() goes to wait_for_memory. If partial
    memory has been alloced by sk_msg_alloc(), that is, msg_tx->sg.size is
    greater than osize after sk_msg_alloc(), memleak occurs. To fix we use
    sk_msg_trim() to release the allocated memory, then goto wait for memory.

    Other call paths of sk_msg_alloc() have the similar issue, such as
    tls_sw_sendmsg(), so handle sk_msg_trim logic inside sk_msg_alloc(),
    as Cong Wang suggested.

    This issue can cause the following info:
    WARNING: CPU: 3 PID: 7950 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
    Call Trace:

    inet_csk_destroy_sock+0x55/0x110
    __tcp_close+0x279/0x470
    tcp_close+0x1f/0x60
    inet_release+0x3f/0x80
    __sock_release+0x3d/0xb0
    sock_close+0x11/0x20
    __fput+0x92/0x250
    task_work_run+0x6a/0xa0
    do_exit+0x33b/0xb60
    do_group_exit+0x2f/0xa0
    get_signal+0xb6/0x950
    arch_do_signal_or_restart+0xac/0x2a0
    exit_to_user_mode_prepare+0xa9/0x200
    syscall_exit_to_user_mode+0x12/0x30
    do_syscall_64+0x46/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    WARNING: CPU: 3 PID: 2094 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
    Call Trace:

    __sk_destruct+0x24/0x1f0
    sk_psock_destroy+0x19b/0x1c0
    process_one_work+0x1b3/0x3c0
    kthread+0xe6/0x110
    ret_from_fork+0x22/0x30

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Wang Yufen
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20220304081145.2037182-3-wangyufen@huawei.com
    Signed-off-by: Sasha Levin

    Wang Yufen
     

09 Mar, 2022

1 commit

  • commit 60ce37b03917e593d8e5d8bcc7ec820773daf81d upstream.

    Currently, sk_psock_verdict_recv() returns skb->len

    This is problematic because tcp_read_sock() might have
    passed orig_len < skb->len, due to the presence of TCP urgent data.

    This causes an infinite loop from tcp_read_sock()

    Followup patch will make tcp_read_sock() more robust vs bad actors.

    Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program")
    Reported-by: syzbot
    Signed-off-by: Eric Dumazet
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Tested-by: Jakub Sitnicki
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/r/20220302161723.3910001-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

14 Dec, 2021

1 commit

  • commit c0d95d3380ee099d735e08618c0d599e72f6c8b0 upstream.

    When a sock is added to a sock map we evaluate what proto op hooks need to
    be used. However, when the program is removed from the sock map we have not
    been evaluating if that changes the required program layout.

    Before the patch listed in the 'fixes' tag this was not causing failures
    because the base program set handles all cases. Specifically, the case with
    a stream parser and the case with out a stream parser are both handled. With
    the fix below we identified a race when running with a proto op that attempts
    to read skbs off both the stream parser and the skb->receive_queue. Namely,
    that a race existed where when the stream parser is empty checking the
    skb->receive_queue from recvmsg at the precies moment when the parser is
    paused and the receive_queue is not empty could result in skipping the stream
    parser. This may break a RX policy depending on the parser to run.

    The fix tag then loads a specific proto ops that resolved this race. But, we
    missed removing that proto ops recv hook when the sock is removed from the
    sockmap. The result is the stream parser is stopped so no more skbs will be
    aggregated there, but the hook and BPF program continues to be attached on
    the psock. User space will then get an EBUSY when trying to read the socket
    because the recvmsg() handler is now waiting on a stopped stream parser.

    To fix we rerun the proto ops init() function which will look at the new set
    of progs attached to the psock and rest the proto ops hook to the correct
    handlers. And in the above case where we remove the sock from the sock map
    the RX prog will no longer be listed so the proto ops is removed.

    Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    John Fastabend
     

19 Nov, 2021

1 commit

  • [ Upstream commit 7303524e04af49a47991e19f895c3b8cdc3796c7 ]

    If sockmap enable strparser, there are lose offset info in
    sk_psock_skb_ingress(). If the length determined by parse_msg function is not
    skb->len, the skb will be converted to sk_msg multiple times, and userspace
    app will get the data multiple times.

    Fix this by get the offset and length from strp_msg. And as Cong suggested,
    add one bit in skb->_sk_redir to distinguish enable or disable strparser.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Liu Jian
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Cong Wang
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20211029141216.211899-1-liujian56@huawei.com
    Signed-off-by: Sasha Levin

    Liu Jian
     

27 Oct, 2021

1 commit


28 Jul, 2021

3 commits

  • If backlog handler is running during a tear down operation we may enqueue
    data on the ingress msg queue while tear down is trying to free it.

    sk_psock_backlog()
    sk_psock_handle_skb()
    skb_psock_skb_ingress()
    sk_psock_skb_ingress_enqueue()
    sk_psock_queue_msg(psock,msg)
    spin_lock(ingress_lock)
    sk_psock_zap_ingress()
    _sk_psock_purge_ingerss_msg()
    _sk_psock_purge_ingress_msg()
    -- free ingress_msg list --
    spin_unlock(ingress_lock)
    spin_lock(ingress_lock)
    list_add_tail(msg,ingress_msg)
    Signed-off-by: Andrii Nakryiko
    Acked-by: Jakub Sitnicki
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20210727160500.1713554-4-john.fastabend@gmail.com

    John Fastabend
     
  • Its possible if a socket is closed and the receive thread is under memory
    pressure it may have cached a skb. We need to ensure these skbs are
    free'd along with the normal ingress_skb queue.

    Before 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
    down and backlog processing both had sock_lock for the common case of
    socket close or unhash. So it was not possible to have both running in
    parrallel so all we would need is the kfree in those kernels.

    But, latest kernels include the commit 799aa7f98d5e and this requires a
    bit more work. Without the ingress_lock guarding reading/writing the
    state->skb case its possible the tear down could run before the state
    update causing it to leak memory or worse when the backlog reads the state
    it could potentially run interleaved with the tear down and we might end up
    free'ing the state->skb from tear down side but already have the reference
    from backlog side. To resolve such races we wrap accesses in ingress_lock
    on both sides serializing tear down and backlog case. In both cases this
    only happens after an EAGAIN error case so having an extra lock in place
    is likely fine. The normal path will skip the locks.

    Note, we check state->skb before grabbing lock. This works because
    we can only enqueue with the mutex we hold already. Avoiding a race
    on adding state->skb after the check. And if tear down path is running
    that is also fine if the tear down path then removes state->skb we
    will simply set skb=NULL and the subsequent goto is skipped. This
    slight complication avoids locking in normal case.

    With this fix we no longer see this warning splat from tcp side on
    socket close when we hit the above case with redirect to ingress self.

    [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
    [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
    [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G I 5.14.0-rc1alu+ #181
    [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
    [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
    [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
    [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
    [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
    [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
    [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
    [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
    [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
    [224913.935974] FS: 00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
    [224913.935981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
    [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [224913.936007] Call Trace:
    [224913.936016] inet_csk_destroy_sock+0xba/0x1f0
    [224913.936033] __tcp_close+0x620/0x790
    [224913.936047] tcp_close+0x20/0x80
    [224913.936056] inet_release+0x8f/0xf0
    [224913.936070] __sock_release+0x72/0x120
    [224913.936083] sock_close+0x14/0x20

    Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
    Signed-off-by: John Fastabend
    Signed-off-by: Andrii Nakryiko
    Acked-by: Jakub Sitnicki
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20210727160500.1713554-3-john.fastabend@gmail.com

    John Fastabend
     
  • We don't want strparser to run and pass skbs into skmsg handlers when
    the psock is null. We just sk_drop them in this case. When removing
    a live socket from map it means extra drops that we do not need to
    incur. Move the zap below strparser close to avoid this condition.

    This way we stop the stream parser first stopping it from processing
    packets and then delete the psock.

    Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
    Signed-off-by: John Fastabend
    Signed-off-by: Andrii Nakryiko
    Acked-by: Jakub Sitnicki
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20210727160500.1713554-2-john.fastabend@gmail.com

    John Fastabend
     

16 Jul, 2021

1 commit

  • If skb_linearize is needed and fails we could leak a msg on the error
    handling. To fix ensure we kfree the msg block before returning error.
    Found during code review.

    Fixes: 4363023d2668e ("bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Cong Wang
    Link: https://lore.kernel.org/bpf/20210712195546.423990-2-john.fastabend@gmail.com

    John Fastabend
     

21 Jun, 2021

6 commits

  • It is hard to observe packet drops without increasing relevant
    drop counters, here we should increase sk->sk_drops which is
    a protocol-independent counter. Fortunately psock is always
    associated with a struct sock, we can just use psock->sk.

    Suggested-by: John Fastabend
    Signed-off-by: Cong Wang
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210615021342.7416-9-xiyou.wangcong@gmail.com

    Cong Wang
     
  • sk_psock_skb_redirect() only takes skb as a parameter, we
    will need to know where this skb is from, so just pass
    the source psock to this function as a new parameter.
    This patch prepares for the next one.

    Signed-off-by: Cong Wang
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210615021342.7416-8-xiyou.wangcong@gmail.com

    Cong Wang
     
  • Currently sk_psock_verdict_apply() is void, but it handles some
    error conditions too. Its caller is impossible to learn whether
    it succeeds or fails, especially sk_psock_verdict_recv().

    Make it return int to indicate error cases and propagate errors
    to callers properly.

    Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program")
    Signed-off-by: Cong Wang
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210615021342.7416-7-xiyou.wangcong@gmail.com

    Cong Wang
     
  • If the dest psock does not set SK_PSOCK_TX_ENABLED,
    the skb can't be queued anywhere so must be dropped.

    This one is found during code review.

    Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
    Signed-off-by: Cong Wang
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210615021342.7416-6-xiyou.wangcong@gmail.com

    Cong Wang
     
  • When we drop skb inside sk_psock_skb_redirect(), we have to clear
    its skb->_sk_redir pointer too, otherwise kfree_skb() would
    misinterpret it as a valid skb->_skb_refdst and dst_release()
    would eventually complain.

    Fixes: e3526bb92a20 ("skmsg: Move sk_redir from TCP_SKB_CB to skb")
    Reported-by: Jiang Wang
    Signed-off-by: Cong Wang
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210615021342.7416-5-xiyou.wangcong@gmail.com

    Cong Wang
     
  • I tried to reuse sk_msg_wait_data() for different protocols,
    but it turns out it can not be simply reused. For example,
    UDP actually uses two queues to receive skb:
    udp_sk(sk)->reader_queue and sk->sk_receive_queue. So we have
    to check both of them to know whether we have received any
    packet.

    Also, UDP does not lock the sock during BH Rx path, it makes
    no sense for its ->recvmsg() to lock the sock. It is always
    possible for ->recvmsg() to be called before packets actually
    arrive in the receive queue, we just use best effort to make
    it accurate here.

    Fixes: 1f5be6b3b063 ("udp: Implement udp_bpf_recvmsg() for sockmap")
    Signed-off-by: Cong Wang
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210615021342.7416-2-xiyou.wangcong@gmail.com

    Cong Wang
     

10 Apr, 2021

1 commit

  • Conflicts:

    MAINTAINERS
    - keep Chandrasekar
    drivers/net/ethernet/mellanox/mlx5/core/en_main.c
    - simple fix + trust the code re-added to param.c in -next is fine
    include/linux/bpf.h
    - trivial
    include/linux/ethtool.h
    - trivial, fix kdoc while at it
    include/linux/skmsg.h
    - move to relevant place in tcp.c, comment re-wrapped
    net/core/skmsg.c
    - add the sk = sk // sk = NULL around calls
    net/tipc/crypto.c
    - trivial

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

07 Apr, 2021

1 commit

  • Incorrect accounting fwd_alloc can result in a warning when the socket
    is torn down,

    [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
    [...]
    [18455.319543] Call Trace:
    [18455.319556] inet_csk_destroy_sock+0xba/0x1f0
    [18455.319577] tcp_rcv_state_process+0x1b4e/0x2380
    [18455.319593] ? lock_downgrade+0x3a0/0x3a0
    [18455.319617] ? tcp_finish_connect+0x1e0/0x1e0
    [18455.319631] ? sk_reset_timer+0x15/0x70
    [18455.319646] ? tcp_schedule_loss_probe+0x1b2/0x240
    [18455.319663] ? lock_release+0xb2/0x3f0
    [18455.319676] ? __release_sock+0x8a/0x1b0
    [18455.319690] ? lock_downgrade+0x3a0/0x3a0
    [18455.319704] ? lock_release+0x3f0/0x3f0
    [18455.319717] ? __tcp_close+0x2c6/0x790
    [18455.319736] ? tcp_v4_do_rcv+0x168/0x370
    [18455.319750] tcp_v4_do_rcv+0x168/0x370
    [18455.319767] __release_sock+0xbc/0x1b0
    [18455.319785] __tcp_close+0x2ee/0x790
    [18455.319805] tcp_close+0x20/0x80

    This currently happens because on redirect case we do skb_set_owner_r()
    with the original sock. This increments the fwd_alloc memory accounting
    on the original sock. Then on redirect we may push this into the queue
    of the psock we are redirecting to. When the skb is flushed from the
    queue we give the memory back to the original sock. The problem is if
    the original sock is destroyed/closed with skbs on another psocks queue
    then the original sock will not have a way to reclaim the memory before
    being destroyed. Then above warning will be thrown

    sockA sockB

    sk_psock_strp_read()
    sk_psock_verdict_apply()
    -- SK_REDIRECT --
    sk_psock_skb_redirect()
    skb_queue_tail(psock_other->ingress_skb..)

    sk_close()
    sock_map_unref()
    sk_psock_put()
    sk_psock_drop()
    sk_psock_zap_ingress()

    At this point we have torn down our own psock, but have the outstanding
    skb in psock_other. Note that SK_PASS doesn't have this problem because
    the sk_psock_drop() logic releases the skb, its still associated with
    our psock.

    To resolve lets only account for sockets on the ingress queue that are
    still associated with the current socket. On the redirect case we will
    check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
    until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
    or received with sk_psock_skb_ingress memory will be claimed on psock_other.

    Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
    Reported-by: Andrii Nakryiko
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/161731444013.68884.4021114312848535993.stgit@john-XPS-13-9370

    John Fastabend
     

02 Apr, 2021

8 commits

  • Although these two functions are only used by TCP, they are not
    specific to TCP at all, both operate on skmsg and ingress_msg,
    so fit in net/core/skmsg.c very well.

    And we will need them for non-TCP, so rename and move them to
    skmsg.c and export them to modules.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com

    Cong Wang
     
  • Currently sockmap calls into each protocol to update the struct
    proto and replace it. This certainly won't work when the protocol
    is implemented as a module, for example, AF_UNIX.

    Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
    protocol can implement its own way to replace the struct proto.
    This also helps get rid of symbol dependencies on CONFIG_INET.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com

    Cong Wang
     
  • Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
    confusing and more importantly we still want to distinguish them
    from user-space. So we can just reuse the stream verdict code but
    introduce a new type of eBPF program, skb_verdict. Users are not
    allowed to attach stream_verdict and skb_verdict programs to the
    same map.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20210331023237.41094-10-xiyou.wangcong@gmail.com

    Cong Wang
     
  • This function is only called in process context.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20210331023237.41094-7-xiyou.wangcong@gmail.com

    Cong Wang
     
  • The RCU callback sk_psock_destroy() only queues work psock->gc,
    so we can just switch to rcu work to simplify the code.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20210331023237.41094-6-xiyou.wangcong@gmail.com

    Cong Wang
     
  • We do not have to lock the sock to avoid losing sk_socket,
    instead we can purge all the ingress queues when we close
    the socket. Sending or receiving packets after orphaning
    socket makes no sense.

    We do purge these queues when psock refcnt reaches zero but
    here we want to purge them explicitly in sock_map_close().
    There are also some nasty race conditions on testing bit
    SK_PSOCK_TX_ENABLED and queuing/canceling the psock work,
    we can expand psock->ingress_lock a bit to protect them too.

    As noticed by John, we still have to lock the psock->work,
    because the same work item could be running concurrently on
    different CPU's.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20210331023237.41094-5-xiyou.wangcong@gmail.com

    Cong Wang
     
  • Currently we rely on lock_sock to protect ingress_msg,
    it is too big for this, we can actually just use a spinlock
    to protect this list like protecting other skb queues.

    __tcp_bpf_recvmsg() is still special because of peeking,
    it still has to use lock_sock.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jakub Sitnicki
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com

    Cong Wang
     
  • Currently we purge the ingress_skb queue only when psock
    refcnt goes down to 0, so locking the queue is not necessary,
    but in order to be called during ->close, we have to lock it
    here.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jakub Sitnicki
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20210331023237.41094-2-xiyou.wangcong@gmail.com

    Cong Wang
     

27 Feb, 2021

7 commits

  • It is now nearly identical to bpf_prog_run_pin_on_cpu() and
    it has an unused parameter 'psock', so we can just get rid
    of it and call bpf_prog_run_pin_on_cpu() directly.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210223184934.6054-9-xiyou.wangcong@gmail.com

    Cong Wang
     
  • It is only used within skmsg.c so can become static.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210223184934.6054-8-xiyou.wangcong@gmail.com

    Cong Wang
     
  • These two eBPF programs are tied to BPF_SK_SKB_STREAM_PARSER
    and BPF_SK_SKB_STREAM_VERDICT, rename them to reflect the fact
    they are only used for TCP. And save the name 'skb_verdict' for
    general use later.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Lorenz Bauer
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210223184934.6054-6-xiyou.wangcong@gmail.com

    Cong Wang
     
  • Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly
    does not work for any other non-TCP protocols. We can move them to
    skb ext, but it introduces a memory allocation on fast path.

    Fortunately, we only need to a word-size to store all the information,
    because the flags actually only contains 1 bit so can be just packed
    into the lowest bit of the "pointer", which is stored as unsigned
    long.

    Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is
    no longer needed after ->sk_data_ready() so we can just drop it.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210223184934.6054-5-xiyou.wangcong@gmail.com

    Cong Wang
     
  • Currently, we compute ->data_end with a compile-time constant
    offset of skb. But as Jakub pointed out, we can actually compute
    it in eBPF JIT code at run-time, so that we can competely get
    rid of ->data_end. This is similar to skb_shinfo(skb) computation
    in bpf_convert_shinfo_access().

    Suggested-by: Jakub Sitnicki
    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210223184934.6054-4-xiyou.wangcong@gmail.com

    Cong Wang
     
  • struct sk_psock_parser is embedded in sk_psock, it is
    unnecessary as skb verdict also uses ->saved_data_ready.
    We can simply fold these fields into sk_psock, and get rid
    of ->enabled.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210223184934.6054-3-xiyou.wangcong@gmail.com

    Cong Wang
     
  • As suggested by John, clean up sockmap related Kconfigs:

    Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream
    parser, to reflect its name.

    Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL
    and CONFIG_INET, the latter is still needed at this point because
    of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched,
    as it is used by non-sockmap cases.

    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Lorenz Bauer
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com

    Cong Wang
     

28 Jan, 2021

1 commit

  • sk_psock_destroy() is a RCU callback, I can't see any reason why
    it could be used outside.

    Signed-off-by: Cong Wang
    Signed-off-by: Daniel Borkmann
    Cc: John Fastabend
    Cc: Jakub Sitnicki
    Cc: Lorenz Bauer
    Link: https://lore.kernel.org/bpf/20210127221501.46866-1-xiyou.wangcong@gmail.com

    Cong Wang
     

18 Nov, 2020

3 commits

  • When skb has a frag_list its possible for skb_to_sgvec() to fail. This
    happens when the scatterlist has fewer elements to store pages than would
    be needed for the initial skb plus any of its frags.

    This case appears rare, but is possible when running an RX parser/verdict
    programs exposed to the internet. Currently, when this happens we throw
    an error, break the pipe, and kfree the msg. This effectively breaks the
    application or forces it to do a retry.

    Lets catch this case and handle it by doing an skb_linearize() on any
    skb we receive with frags. At this point skb_to_sgvec should not fail
    because the failing conditions would require frags to be in place.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556576837.73229.14800682790808797635.stgit@john-XPS-13-9370

    John Fastabend
     
  • If the skb_verdict_prog redirects an skb knowingly to itself, fix your
    BPF program this is not optimal and an abuse of the API please use
    SK_PASS. That said there may be cases, such as socket load balancing,
    where picking the socket is hashed based or otherwise picks the same
    socket it was received on in some rare cases. If this happens we don't
    want to confuse userspace giving them an EAGAIN error if we can avoid
    it.

    To avoid double accounting in these cases. At the moment even if the
    skb has already been charged against the sockets rcvbuf and forward
    alloc we check it again and do set_owner_r() causing it to be orphaned
    and recharged. For one this is useless work, but more importantly we
    can have a case where the skb could be put on the ingress queue, but
    because we are under memory pressure we return EAGAIN. The trouble
    here is the skb has already been accounted for so any rcvbuf checks
    include the memory associated with the packet already. This rolls
    up and can result in unnecessary EAGAIN errors in userspace read()
    calls.

    Fix by doing an unlikely check and skipping checks if skb->sk == sk.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556574804.73229.11328201020039674147.stgit@john-XPS-13-9370

    John Fastabend
     
  • If a socket redirects to itself and it is under memory pressure it is
    possible to get a socket stuck so that recv() returns EAGAIN and the
    socket can not advance for some time. This happens because when
    redirecting a skb to the same socket we received the skb on we first
    check if it is OK to enqueue the skb on the receiving socket by checking
    memory limits. But, if the skb is itself the object holding the memory
    needed to enqueue the skb we will keep retrying from kernel side
    and always fail with EAGAIN. Then userspace will get a recv() EAGAIN
    error if there are no skbs in the psock ingress queue. This will continue
    until either some skbs get kfree'd causing the memory pressure to
    reduce far enough that we can enqueue the pending packet or the
    socket is destroyed. In some cases its possible to get a socket
    stuck for a noticeable amount of time if the socket is only receiving
    skbs from sk_skb verdict programs. To reproduce I make the socket
    memory limits ridiculously low so sockets are always under memory
    pressure. More often though if under memory pressure it looks like
    a spurious EAGAIN error on user space side causing userspace to retry
    and typically enough has moved on the memory side that it works.

    To fix skip memory checks and skb_orphan if receiving on the same
    sock as already assigned.

    For SK_PASS cases this is easy, its always the same socket so we
    can just omit the orphan/set_owner pair.

    For backlog cases we need to check skb->sk and decide if the orphan
    and set_owner pair are needed.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370

    John Fastabend