Eric Lee / smarc-fsl-linux-kernel

18 Nov, 2020

5 commits

4363023d2 bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list ... Browse Code »

When skb has a frag_list its possible for skb_to_sgvec() to fail. This
happens when the scatterlist has fewer elements to store pages than would
be needed for the initial skb plus any of its frags.

This case appears rare, but is possible when running an RX parser/verdict
programs exposed to the internet. Currently, when this happens we throw
an error, break the pipe, and kfree the msg. This effectively breaks the
application or forces it to do a retry.

Lets catch this case and handle it by doing an skb_linearize() on any
skb we receive with frags. At this point skb_to_sgvec should not fail
because the failing conditions would require frags to be in place.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Reviewed-by: Jakub Sitnicki
Link: https://lore.kernel.org/bpf/160556576837.73229.14800682790808797635.stgit@john-XPS-13-9370

John Fastabend
2020-11-18 07:14:04 +0800
2443ca666 bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self ... Browse Code »

If the skb_verdict_prog redirects an skb knowingly to itself, fix your
BPF program this is not optimal and an abuse of the API please use
SK_PASS. That said there may be cases, such as socket load balancing,
where picking the socket is hashed based or otherwise picks the same
socket it was received on in some rare cases. If this happens we don't
want to confuse userspace giving them an EAGAIN error if we can avoid
it.

To avoid double accounting in these cases. At the moment even if the
skb has already been charged against the sockets rcvbuf and forward
alloc we check it again and do set_owner_r() causing it to be orphaned
and recharged. For one this is useless work, but more importantly we
can have a case where the skb could be put on the ingress queue, but
because we are under memory pressure we return EAGAIN. The trouble
here is the skb has already been accounted for so any rcvbuf checks
include the memory associated with the packet already. This rolls
up and can result in unnecessary EAGAIN errors in userspace read()
calls.

Fix by doing an unlikely check and skipping checks if skb->sk == sk.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Reviewed-by: Jakub Sitnicki
Link: https://lore.kernel.org/bpf/160556574804.73229.11328201020039674147.stgit@john-XPS-13-9370

John Fastabend
2020-11-18 07:13:27 +0800
6fa9201a8 bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self ... Browse Code »

If a socket redirects to itself and it is under memory pressure it is
possible to get a socket stuck so that recv() returns EAGAIN and the
socket can not advance for some time. This happens because when
redirecting a skb to the same socket we received the skb on we first
check if it is OK to enqueue the skb on the receiving socket by checking
memory limits. But, if the skb is itself the object holding the memory
needed to enqueue the skb we will keep retrying from kernel side
and always fail with EAGAIN. Then userspace will get a recv() EAGAIN
error if there are no skbs in the psock ingress queue. This will continue
until either some skbs get kfree'd causing the memory pressure to
reduce far enough that we can enqueue the pending packet or the
socket is destroyed. In some cases its possible to get a socket
stuck for a noticeable amount of time if the socket is only receiving
skbs from sk_skb verdict programs. To reproduce I make the socket
memory limits ridiculously low so sockets are always under memory
pressure. More often though if under memory pressure it looks like
a spurious EAGAIN error on user space side causing userspace to retry
and typically enough has moved on the memory side that it works.

To fix skip memory checks and skb_orphan if receiving on the same
sock as already assigned.

For SK_PASS cases this is easy, its always the same socket so we
can just omit the orphan/set_owner pair.

For backlog cases we need to check skb->sk and decide if the orphan
and set_owner pair are needed.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Reviewed-by: Jakub Sitnicki
Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370

John Fastabend
2020-11-18 07:12:41 +0800
70796fb75 bpf, sockmap: Use truesize with sk_rmem_schedule() ... Browse Code »

We use skb->size with sk_rmem_scheduled() which is not correct. Instead
use truesize to align with socket and tcp stack usage of sk_rmem_schedule.

Suggested-by: Daniel Borkman
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Reviewed-by: Jakub Sitnicki
Link: https://lore.kernel.org/bpf/160556570616.73229.17003722112077507863.stgit@john-XPS-13-9370

John Fastabend
2020-11-18 07:12:37 +0800
36cd0e696 bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect ... Browse Code »

Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
allows users to tune SO_RCVBUF and sockmap will honor them.

We can refactor the if(charge) case out in later patches. But, keep this
fix to the point.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Suggested-by: Jakub Sitnicki
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Reviewed-by: Jakub Sitnicki
Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370

John Fastabend
2020-11-18 07:12:34 +0800

12 Oct, 2020

7 commits

ef5659280 bpf, sockmap: Allow skipping sk_skb parser program ... Browse Code »

Currently, we often run with a nop parser namely one that just does
this, 'return skb->len'. This happens when either our verdict program
can handle streaming data or it is only looking at socket data such
as IP addresses and other metadata associated with the flow. The second
case is common for a L3/L4 proxy for instance.

So lets allow loading programs without the parser then we can skip
the stream parser logic and avoid having to add a BPF program that
is effectively a nop.

Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/160239297866.8495.13345662302749219672.stgit@john-Precision-5820-Tower

John Fastabend
2020-10-12 09:09:44 +0800
0b17ad25d bpf, sockmap: Add memory accounting so skbs on ingress lists are visible ... Browse Code »

Move skb->sk assignment out of sk_psock_bpf_run() and into individual
callers. Then we can use proper skb_set_owner_r() call to assign a
sk to a skb. This improves things by also charging the truesize against
the sockets sk_rmem_alloc counter. With this done we get some accounting
in place to ensure the memory associated with skbs on the workqueue are
still being accounted for somewhere. Finally, by using skb_set_owner_r
the destructor is setup so we can just let the normal skb_kfree logic
recover the memory. Combined with previous patch dropping skb_orphan()
we now can recover from memory pressure and maintain accounting.

Note, we will charge the skbs against their originating socket even
if being redirected into another socket. Once the skb completes the
redirect op the kfree_skb will give the memory back. This is important
because if we charged the socket we are redirecting to (like it was
done before this series) the sock_writeable() test could fail because
of the skb trying to be sent is already charged against the socket.

Also TLS case is special. Here we wait until we have decided not to
simply PASS the packet up the stack. In the case where we PASS the
packet up the stack we already have an skb which is accounted for on
the TLS socket context.

For the parser case we continue to just set/clear skb->sk this is
because the skb being used here may be combined with other skbs or
turned into multiple skbs depending on the parser logic. For example
the parser could request a payload length greater than skb->len so
that the strparser needs to collect multiple skbs. At any rate
the final result will be handled in the strparser recv callback.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/160226867513.5692.10579573214635925960.stgit@john-Precision-5820-Tower

John Fastabend
2020-10-12 09:00:57 +0800
10d58d006 bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup ... Browse Code »

Calling skb_orphan() is unnecessary in the strp rcv handler because the skb
is from a skb_clone() in __strp_recv. So it never has a destructor or a
sk assigned. Plus its confusing to read because it might hint to the reader
that the skb could have an sk assigned which is not true. Even if we did
have an sk assigned it would be cleaner to simply wait for the upcoming
kfree_skb().

Additionally, move the comment about strparser clone up so its closer to
the logic it is describing and add to it so that it is more complete.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/160226865548.5692.9098315689984599579.stgit@john-Precision-5820-Tower

John Fastabend
2020-10-12 09:00:57 +0800
9047f19e7 bpf, sockmap: Remove dropped data on errors in redirect case ... Browse Code »

In the sk_skb redirect case we didn't handle the case where we overrun
the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
Because we didn't have anything implemented we simply dropped the skb.
This meant data could be dropped if socket memory accounting was in
place.

This fixes the above dropped data case by moving the memory checks
later in the code where we actually do the send or recv. This pushes
those checks into the workqueue and allows us to return an EAGAIN error
which in turn allows us to try again later from the workqueue.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/160226863689.5692.13861422742592309285.stgit@john-Precision-5820-Tower

John Fastabend
2020-10-12 09:00:57 +0800
29545f497 bpf, sockmap: Remove skb_set_owner_w wmem will be taken later from sendpage ... Browse Code »

The skb_set_owner_w is unnecessary here. The sendpage call will create a
fresh skb and set the owner correctly from workqueue. Its also not entirely
harmless because it consumes cycles, but also impacts resource accounting
by increasing sk_wmem_alloc. This is charging the socket we are going to
send to for the skb, but we will put it on the workqueue for some time
before this happens so we are artifically inflating sk_wmem_alloc for
this period. Further, we don't know how many skbs will be used to send the
packet or how it will be broken up when sent over the new socket so
charging it with one big sum is also not correct when the workqueue may
break it up if facing memory pressure. Seeing we don't know how/when
this is going to be sent drop the early accounting.

A later patch will do proper accounting charged on receive socket for
the case where skbs get enqueued on the workqueue.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/160226861708.5692.17964237936462425136.stgit@john-Precision-5820-Tower

John Fastabend
2020-10-12 09:00:57 +0800
9ecbfb06a bpf, sockmap: On receive programs try to fast track SK_PASS ingress ... Browse Code »

When we receive an skb and the ingress skb verdict program returns
SK_PASS we currently set the ingress flag and put it on the workqueue
so it can be turned into a sk_msg and put on the sk_msg ingress queue.
Then finally telling userspace with data_ready hook.

Here we observe that if the workqueue is empty then we can try to
convert into a sk_msg type and call data_ready directly without
bouncing through a workqueue. Its a common pattern to have a recv
verdict program for visibility that always returns SK_PASS. In this
case unless there is an ENOMEM error or we overrun the socket we
can avoid the workqueue completely only using it when we fall back
to error cases caused by memory pressure.

By doing this we eliminate another case where data may be dropped
if errors occur on memory limits in workqueue.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower

John Fastabend
2020-10-12 09:00:57 +0800
cfea28f89 bpf, sockmap: Skb verdict SK_PASS to self already checked rmem limits ... Browse Code »

For sk_skb case where skb_verdict program returns SK_PASS to continue to
pass packet up the stack, the memory limits were already checked before
enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks
here. The theory is if the TCP stack believes we have memory to receive
the packet then lets trust the stack and not double check the limits.

In fact the accounting here can cause a drop if sk_rmem_alloc has increased
after the stack accepted this packet, but before the duplicate check here.
And worse if this happens because TCP stack already believes the data has
been received there is no retransmit.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower

John Fastabend
2020-10-12 09:00:57 +0800

05 Sep, 2020

1 commit

44a8c4f33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444a4 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-09-05 12:28:59 +0800

24 Aug, 2020

1 commit

df561f668 treewide: Use fallthrough pseudo-keyword ... Browse Code »

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva

Gustavo A. R. Silva
2020-08-24 06:36:59 +0800

22 Aug, 2020

1 commit

7b219da43 net: sk_msg: Simplify sk_psock initialization ... Browse Code »

Initializing psock->sk_proto and other saved callbacks is only
done in sk_psock_update_proto, after sk_psock_init has returned.
The logic for this is difficult to follow, and needlessly complex.

Instead, initialize psock->sk_proto whenever we allocate a new
psock. Additionally, assert the following invariants:

* The SK has no ULP: ULP does it's own finagling of sk->sk_prot
* sk_user_data is unused: we need it to store sk_psock

Protect our access to sk_user_data with sk_callback_lock, which
is what other users like reuseport arrays, etc. do.

The result is that an sk_psock is always fully initialized, and
that psock->sk_proto is always the "original" struct proto.
The latter allows us to use psock->sk_proto when initializing
IPv6 TCP / UDP callbacks for sockmap.

Signed-off-by: Lorenz Bauer
Signed-off-by: Alexei Starovoitov
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20200821102948.21918-2-lmb@cloudflare.com

Lorenz Bauer
2020-08-22 06:16:11 +0800

28 Jun, 2020

2 commits

8025751d4 bpf, sockmap: RCU dereferenced psock may be used outside RCU block ... Browse Code »

If an ingress verdict program specifies message sizes greater than
skb->len and there is an ENOMEM error due to memory pressure we
may call the rcv_msg handler outside the strp_data_ready() caller
context. This is because on an ENOMEM error the strparser will
retry from a workqueue. The caller currently protects the use of
psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
block.

But, in above workqueue error case the psock is accessed outside
the read_lock/unlock block of the caller. So instead of using
psock directly we must do a look up against the sk again to
ensure the psock is available.

There is an an ugly piece here where we must handle
the case where we paused the strp and removed the psock. On
psock removal we first pause the strparser and then remove
the psock. If the strparser is paused while an skb is
scheduled on the workqueue the skb will be dropped on the
flow and kfree_skb() is called. If the workqueue manages
to get called before we pause the strparser but runs the rcvmsg
callback after the psock is removed we will hit the unlikely
case where we run the sockmap rcvmsg handler but do not have
a psock. For now we will follow strparser logic and drop the
skb on the floor with skb_kfree(). This is ugly because the
data is dropped. To date this has not caused problems in practice
because either the application controlling the sockmap is
coordinating with the datapath so that skbs are "flushed"
before removal or we simply wait for the sock to be closed before
removing it.

This patch fixes the describe RCU bug and dropping the skb doesn't
make things worse. Future patches will improve this by allowing
the normal case where skbs are not merged to skip the strparser
altogether. In practice many (most?) use cases have no need to
merge skbs so its both a code complexity hit as seen above and
a performance issue. For example, in the Cilium case we always
set the strparser up to return sbks 1:1 without any merging and
have avoided above issues.

Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Acked-by: Martin KaFai Lau
Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370

John Fastabend
2020-06-28 23:33:28 +0800
93dd5f185 bpf, sockmap: RCU splat with redirect and strparser error or TLS ... Browse Code »

There are two paths to generate the below RCU splat the first and
most obvious is the result of the BPF verdict program issuing a
redirect on a TLS socket (This is the splat shown below). Unlike
the non-TLS case the caller of the *strp_read() hooks does not
wrap the call in a rcu_read_lock/unlock. Then if the BPF program
issues a redirect action we hit the RCU splat.

However, in the non-TLS socket case the splat appears to be
relatively rare, because the skmsg caller into the strp_data_ready()
is wrapped in a rcu_read_lock/unlock. Shown here,

static void sk_psock_strp_data_ready(struct sock *sk)
{
struct sk_psock *psock;

rcu_read_lock();
psock = sk_psock(sk);
if (likely(psock)) {
if (tls_sw_has_ctx_rx(sk)) {
psock->parser.saved_data_ready(sk);
} else {
write_lock_bh(&sk->sk_callback_lock);
strp_data_ready(&psock->parser.strp);
write_unlock_bh(&sk->sk_callback_lock);
}
}
rcu_read_unlock();
}

If the above was the only way to run the verdict program we
would be safe. But, there is a case where the strparser may throw an
ENOMEM error while parsing the skb. This is a result of a failed
skb_clone, or alloc_skb_for_msg while building a new merged skb when
the msg length needed spans multiple skbs. This will in turn put the
skb on the strp_wrk workqueue in the strparser code. The skb will
later be dequeued and verdict programs run, but now from a
different context without the rcu_read_lock()/unlock() critical
section in sk_psock_strp_data_ready() shown above. In practice
I have not seen this yet, because as far as I know most users of the
verdict programs are also only working on single skbs. In this case no
merge happens which could trigger the above ENOMEM errors. In addition
the system would need to be under memory pressure. For example, we
can't hit the above case in selftests because we missed having tests
to merge skbs. (Added in later patch)

To fix the below splat extend the rcu_read_lock/unnlock block to
include the call to sk_psock_tls_verdict_apply(). This will fix both
TLS redirect case and non-TLS redirect+error case. Also remove
psock from the sk_psock_tls_verdict_apply() function signature its
not used there.

[ 1095.937597] WARNING: suspicious RCU usage
[ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W
[ 1095.944363] -----------------------------
[ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
[ 1095.950866]
[ 1095.950866] other info that might help us debug this:
[ 1095.950866]
[ 1095.957146]
[ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
[ 1095.961482] 1 lock held by test_sockmap/15970:
[ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
[ 1095.968568]
[ 1095.968568] stack backtrace:
[ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1
[ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1095.980519] Call Trace:
[ 1095.982191] dump_stack+0x8f/0xd0
[ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0
[ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250
[ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls]

v2: Improve commit message to identify non-TLS redirect plus error case
condition as well as more common TLS case. In the process I decided
doing the rcu_read_unlock followed by the lock/unlock inside branches
was unnecessarily complex. We can just extend the current rcu block
and get the same effeective without the shuffling and branching.
Thanks Martin!

Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
Reported-by: Jakub Sitnicki
Reported-by: kernel test robot
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Acked-by: Martin KaFai Lau
Acked-by: Jakub Sitnicki
Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370

John Fastabend
2020-06-28 23:33:28 +0800

02 Jun, 2020

2 commits

e91de6afa bpf: Fix running sk_skb program types with ktls ... Browse Code »

KTLS uses a stream parser to collect TLS messages and send them to
the upper layer tls receive handler. This ensures the tls receiver
has a full TLS header to parse when it is run. However, when a
socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
is enabled we end up with two stream parsers running on the same
socket.

The result is both try to run on the same socket. First the KTLS
stream parser runs and calls read_sock() which will tcp_read_sock
which in turn calls tcp_rcv_skb(). This dequeues the skb from the
sk_receive_queue. When this is done KTLS code then data_ready()
callback which because we stacked KTLS on top of the bpf stream
verdict program has been replaced with sk_psock_start_strp(). This
will in turn kick the stream parser again and eventually do the
same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
a skb from the sk_receive_queue.

At this point the data stream is broke. Part of the stream was
handled by the KTLS side some other bytes may have been handled
by the BPF side. Generally this results in either missing data
or more likely a "Bad Message" complaint from the kTLS receive
handler as the BPF program steals some bytes meant to be in a
TLS header and/or the TLS header length is no longer correct.

We've already broke the idealized model where we can stack ULPs
in any order with generic callbacks on the TX side to handle this.
So in this patch we do the same thing but for RX side. We add
a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
program is running and add a tls_sw_has_ctx_rx() helper so BPF
side can learn there is a TLS ULP on the socket.

Then on BPF side we omit calling our stream parser to avoid
breaking the data stream for the KTLS receiver. Then on the
KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
receiver is done with the packet but before it posts the
msg to userspace. This gives us symmetry between the TX and
RX halfs and IMO makes it usable again. On the TX side we
process packets in this order BPF -> TLS -> TCP and on
the receive side in the reverse order TCP -> TLS -> BPF.

Discovered while testing OpenSSL 3.0 Alpha2.0 release.

Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov

John Fastabend
2020-06-02 05:48:32 +0800
ca2f5f21d bpf: Refactor sockmap redirect code so its easy to reuse ... Browse Code »

We will need this block of code called from tls context shortly
lets refactor the redirect logic so its easy to use. This also
cleans up the switch stmt so we have fewer fallthrough cases.

No logic changes are intended.

Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Reviewed-by: Jakub Sitnicki
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov

John Fastabend
2020-06-02 05:48:32 +0800

25 Feb, 2020

1 commit

3d9f773cf bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites. ... Browse Code »

All of these cases are strictly of the form:

preempt_disable();
BPF_PROG_RUN(...);
preempt_enable();

Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
with:

migrate_disable();
BPF_PROG_RUN(...);
migrate_enable();

On non RT enabled kernels this maps to preempt_disable/enable() and on RT
enabled kernels this solely prevents migration, which is sufficient as
there is no requirement to prevent reentrancy to any BPF program from a
preempting task. The only requirement is that the program stays on the same
CPU.

Therefore, this is a trivially correct transformation.

The seccomp loop does not need protection over the loop. It only needs
protection per BPF filter program

[ tglx: Converted to bpf_prog_run_pin_on_cpu() ]

Signed-off-by: David S. Miller
Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de

David Miller
2020-02-25 08:20:09 +0800

22 Feb, 2020

1 commit

f1ff5ce2c net, sk_msg: Clear sk_user_data pointer on clone if tagged ... Browse Code »

sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.

One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.

The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.

This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.

Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.

It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.

Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.

Signed-off-by: Jakub Sitnicki
Signed-off-by: Daniel Borkmann
Acked-by: John Fastabend
Acked-by: Martin KaFai Lau
Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com

Jakub Sitnicki
2020-02-22 05:29:45 +0800

23 Jan, 2020

1 commit

58c8db929 net, sk_msg: Don't check if sock is locked when tearing down psock ... Browse Code »

As John Fastabend reports [0], psock state tear-down can happen on receive
path *after* unlocking the socket, if the only other psock user, that is
sockmap or sockhash, releases its psock reference before tcp_bpf_recvmsg
does so:

tcp_bpf_recvmsg()
psock = sk_psock_get(sk)
Signed-off-by: Jakub Sitnicki
Acked-by: John Fastabend
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Jakub Sitnicki
2020-01-23 03:30:20 +0800

16 Jan, 2020

1 commit

7e81a3530 bpf: Sockmap, ensure sock lock held during tear down ... Browse Code »

The sock_map_free() and sock_hash_free() paths used to delete sockmap
and sockhash maps walk the maps and destroy psock and bpf state associated
with the socks in the map. When done the socks no longer have BPF programs
attached and will function normally. This can happen while the socks in
the map are still "live" meaning data may be sent/received during the walk.

Currently, though we don't take the sock_lock when the psock and bpf state
is removed through this path. Specifically, this means we can be writing
into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
while they are also being called from the networking side. This is not
safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
believed it was safe. Further its not clear to me its even a good idea
to try and do this on "live" sockets while networking side might also
be using the socket. Instead of trying to reason about using the socks
from both sides lets realize that every use case I'm aware of rarely
deletes maps, in fact kubernetes/Cilium case builds map at init and
never tears it down except on errors. So lets do the simple fix and
grab sock lock.

This patch wraps sock deletes from maps in sock lock and adds some
annotations so we catch any other cases easier.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Acked-by: Song Liu
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com

John Fastabend
2020-01-16 06:26:13 +0800

29 Nov, 2019

1 commit

031097d9e net: skmsg: fix TLS 1.3 crash with full sk_msg ... Browse Code »

TLS 1.3 started using the entry at the end of the SG array
for chaining-in the single byte content type entry. This mostly
works:

[ E E E E E E . . ]
^ ^
start end

E < content type
/
[ E E E E E E C . ]
^ ^
start end

(Where E denotes a populated SG entry; C denotes a chaining entry.)

If the array is full, however, the end will point to the start:

[ E E E E E E E E ]
^
start
end

And we end up overwriting the start:

E < content type
/
[ C E E E E E E E ]
^
start
end

The sg array is supposed to be a circular buffer with start and
end markers pointing anywhere. In case where start > end
(i.e. the circular buffer has "wrapped") there is an extra entry
reserved at the end to chain the two halves together.

[ E E E E E E . . l ]

(Where l is the reserved entry for "looping" back to front.

As suggested by John, let's reserve another entry for chaining
SG entries after the main circular buffer. Note that this entry
has to be pointed to by the end entry so its position is not fixed.

Examples of full messages:

[ E E E E E E E E . l ]
^ ^
start end

Reviewed-by: Simon Horman
Signed-off-by: David S. Miller

Jakub Kicinski
2019-11-29 14:40:29 +0800

22 Nov, 2019

1 commit

8163999db bpf: skmsg, fix potential psock NULL pointer dereference ... Browse Code »

Report from Dan Carpenter,

net/core/skmsg.c:792 sk_psock_write_space()
error: we previously assumed 'psock' could be null (see line 790)

net/core/skmsg.c
789 psock = sk_psock(sk);
790 if (likely(psock && sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)))
Check for NULL
791 schedule_work(&psock->work);
792 write_space = psock->saved_write_space;
^^^^^^^^^^^^^^^^^^^^^^^^
793 rcu_read_unlock();
794 write_space(sk);

Ensure psock dereference on line 792 only occurs if psock is not null.

Reported-by: Dan Carpenter
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2019-11-22 05:43:45 +0800

06 Nov, 2019

1 commit

683916f6a net/tls: fix sk_msg trim on fallback to copy mode ... Browse Code »

sk_msg_trim() tries to only update curr pointer if it falls into
the trimmed region. The logic, however, does not take into the
account pointer wrapping that sk_msg_iter_var_prev() does nor
(as John points out) the fact that msg->sg is a ring buffer.

This means that when the message was trimmed completely, the new
curr pointer would have the value of MAX_MSG_FRAGS - 1, which is
neither smaller than any other value, nor would it actually be
correct.

Special case the trimming to 0 length a little bit and rework
the comparison between curr and end to take into account wrapping.

This bug caused the TLS code to not copy all of the message, if
zero copy filled in fewer sg entries than memcopy would need.

Big thanks to Alexander Potapenko for the non-KMSAN reproducer.

v2:
- take into account that msg->sg is a ring buffer (John).

Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ (v1)

Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com
Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com
Co-developed-by: John Fastabend
Signed-off-by: Jakub Kicinski
Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

Jakub Kicinski
2019-11-06 10:07:47 +0800

25 Aug, 2019

1 commit

dd016aca2 net/core/skmsg: Delete an unnecessary check before the function call “consume_skb” ... Browse Code »

The consume_skb() function performs also input parameter validation.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring
Acked-by: Song Liu
Signed-off-by: David S. Miller

Markus Elfring
2019-08-25 07:24:53 +0800

22 Jul, 2019

1 commit

95fa14547 bpf: sockmap/tls, close can race with map free ... Browse Code »

When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

v4:
- make sure we don't free things for device;
- remove the checks which swap the callbacks back
only if TLS is at the top.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend
Signed-off-by: Jakub Kicinski
Reviewed-by: Dirk van der Merwe
Signed-off-by: Daniel Borkmann

John Fastabend
2019-07-22 22:04:17 +0800

14 May, 2019

2 commits

cabede8b4 bpf: sockmap fix msg->sg.size account on ingress skb ... Browse Code »

When converting a skb to msg->sg we forget to set the size after the
latest ktls/tls code conversion. This patch can be reached by doing
a redir into ingress path from BPF skb sock recv hook. Then trying to
read the size fails.

Fix this by setting the size.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann

John Fastabend
2019-05-14 07:31:43 +0800
014894360 bpf: sockmap, only stop/flush strp if it was enabled at some point ... Browse Code »

If we try to call strp_done on a parser that has never been
initialized, because the sockmap user is only using TX side for
example we get the following error.

[ 883.422081] WARNING: CPU: 1 PID: 208 at kernel/workqueue.c:3030 __flush_work+0x1ca/0x1e0
...
[ 883.422095] Workqueue: events sk_psock_destroy_deferred
[ 883.422097] RIP: 0010:__flush_work+0x1ca/0x1e0

This had been wrapped in a 'if (psock->parser.enabled)' logic which
was broken because the strp_done() was never actually being called
because we do a strp_stop() earlier in the tear down logic will
set parser.enabled to false. This could result in a use after free
if work was still in the queue and was resolved by the patch here,
1d79895aef18f ("sk_msg: Always cancel strp work before freeing the
psock"). However, calling strp_stop(), done by the patch marked in
the fixes tag, only is useful if we never initialized a strp parser
program and never initialized the strp to start with. Because if
we had initialized a stream parser strp_stop() would have been called
by sk_psock_drop() earlier in the tear down process. By forcing the
strp to stop we get past the WARNING in strp_done that checks
the stopped flag but calling cancel_work_sync on work that has never
been initialized is also wrong and generates the warning above.

To fix check if the parser program exists. If the program exists
then the strp work has been initialized and must be sync'd and
cancelled before free'ing any structures. If no program exists we
never initialized the stream parser in the first place so skip the
sync/cancel logic implemented by strp_done.

Finally, remove the strp_done its not needed and in the case where we
are using the stream parser has already been called.

Fixes: e8e3437762ad9 ("bpf: Stop the psock parser before canceling its work")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann

John Fastabend
2019-05-14 07:31:31 +0800

07 Mar, 2019

1 commit

e8e343776 bpf: Stop the psock parser before canceling its work ... Browse Code »

We might have never enabled (started) the psock's parser, in which case it
will not get stopped when destroying the psock. This leads to a warning
when trying to cancel parser's work from psock's deferred destructor:

[ 405.325769] WARNING: CPU: 1 PID: 3216 at net/strparser/strparser.c:526 strp_done+0x3c/0x40
[ 405.326712] Modules linked in: [last unloaded: test_bpf]
[ 405.327359] CPU: 1 PID: 3216 Comm: kworker/1:164 Tainted: G W 5.0.0 #42
[ 405.328294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[ 405.329712] Workqueue: events sk_psock_destroy_deferred
[ 405.330254] RIP: 0010:strp_done+0x3c/0x40
[ 405.330706] Code: 28 e8 b8 d5 6b ff 48 8d bb 80 00 00 00 e8 9c d5 6b ff 48 8b 7b 18 48 85 ff 74 0d e8 1e a5 e8 ff 48 c7 43 18 00 00 00 00 5b c3 0b eb cf 66 66 66 66 90 55 89 f5 53 48 89 fb 48 83 c7 28 e8 0b
[ 405.332862] RSP: 0018:ffffc900026bbe50 EFLAGS: 00010246
[ 405.333482] RAX: ffffffff819323e0 RBX: ffff88812cb83640 RCX: ffff88812cb829e8
[ 405.334228] RDX: 0000000000000001 RSI: ffff88812cb837e8 RDI: ffff88812cb83640
[ 405.335366] RBP: ffff88813fd22680 R08: 0000000000000000 R09: 000073746e657665
[ 405.336472] R10: 8080808080808080 R11: 0000000000000001 R12: ffff88812cb83600
[ 405.337760] R13: 0000000000000000 R14: ffff88811f401780 R15: ffff88812cb837e8
[ 405.338777] FS: 0000000000000000(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
[ 405.339903] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 405.340821] CR2: 00007fb11489a6b8 CR3: 000000012d4d6000 CR4: 00000000000406e0
[ 405.341981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 405.343131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 405.344415] Call Trace:
[ 405.344821] sk_psock_destroy_deferred+0x23/0x1b0
[ 405.345585] process_one_work+0x1ae/0x3e0
[ 405.346110] worker_thread+0x3c/0x3b0
[ 405.346576] ? pwq_unbound_release_workfn+0xd0/0xd0
[ 405.347187] kthread+0x11d/0x140
[ 405.347601] ? __kthread_parkme+0x80/0x80
[ 405.348108] ret_from_fork+0x35/0x40
[ 405.348566] ---[ end trace a4a3af4026a327d4 ]---

Stop psock's parser just before canceling its work.

Fixes: 1d79895aef18 ("sk_msg: Always cancel strp work before freeing the psock")
Reported-by: kernel test robot
Signed-off-by: Jakub Sitnicki
Signed-off-by: Daniel Borkmann

Jakub Sitnicki
2019-03-07 22:16:20 +0800

09 Feb, 2019

1 commit

a655fe9f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.

Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.

Signed-off-by: David S. Miller

David S. Miller
2019-02-09 07:00:17 +0800

29 Jan, 2019

1 commit

1d79895ae sk_msg: Always cancel strp work before freeing the psock ... Browse Code »

Despite having stopped the parser, we still need to deinitialize it
by calling strp_done so that it cancels its work. Otherwise the worker
thread can run after we have freed the parser, and attempt to access
its workqueue resulting in a use-after-free:

==================================================================
BUG: KASAN: use-after-free in pwq_activate_delayed_work+0x1b/0x1d0
Read of size 8 at addr ffff888069975240 by task kworker/u2:2/93

CPU: 0 PID: 93 Comm: kworker/u2:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
Workqueue: (null) (kstrp)
Call Trace:
print_address_description+0x6e/0x2b0
? pwq_activate_delayed_work+0x1b/0x1d0
kasan_report+0xfd/0x177
? pwq_activate_delayed_work+0x1b/0x1d0
? pwq_activate_delayed_work+0x1b/0x1d0
pwq_activate_delayed_work+0x1b/0x1d0
? process_one_work+0x4aa/0x660
pwq_dec_nr_in_flight+0x9b/0x100
worker_thread+0x82/0x680
? process_one_work+0x660/0x660
kthread+0x1b9/0x1e0
? __kthread_create_on_node+0x250/0x250
ret_from_fork+0x1f/0x30

Allocated by task 111:
sk_psock_init+0x3c/0x1b0
sock_map_link.isra.2+0x103/0x4b0
sock_map_update_common+0x94/0x270
sock_map_update_elem+0x145/0x160
__se_sys_bpf+0x152e/0x1e10
do_syscall_64+0xb2/0x3e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 112:
kfree+0x7f/0x140
process_one_work+0x40b/0x660
worker_thread+0x82/0x680
kthread+0x1b9/0x1e0
ret_from_fork+0x1f/0x30

The buggy address belongs to the object at ffff888069975180
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 192 bytes inside of
512-byte region [ffff888069975180, ffff888069975380)
The buggy address belongs to the page:
page:ffffea0001a65d00 count:1 mapcount:0 mapping:ffff88806d401280 index:0x0 compound_mapcount: 0
flags: 0x4000000000010200(slab|head)
raw: 4000000000010200 dead000000000100 dead000000000200 ffff88806d401280
raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888069975100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888069975180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888069975200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888069975280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888069975300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Reported-by: Marek Majkowski
Signed-off-by: Jakub Sitnicki
Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
Acked-by: Song Liu
Signed-off-by: Daniel Borkmann

Jakub Sitnicki
2019-01-29 07:05:03 +0800

18 Jan, 2019

1 commit

fda497e5f Optimize sk_msg_clone() by data merge to end dst sg entry ... Browse Code »

Function sk_msg_clone has been modified to merge the data from source sg
entry to destination sg entry if the cloned data resides in same page
and is contiguous to the end entry of destination sk_msg. This improves
kernel tls throughput to the tune of 10%.

When the user space tls application calls sendmsg() with MSG_MORE, it leads
to calling sk_msg_clone() with new data being cloned placed continuous to
previously cloned data. Without this optimization, a new SG entry in
the destination sk_msg i.e. rec->msg_plaintext in tls_clone_plaintext_msg()
gets used. This leads to exhaustion of sg entries in rec->msg_plaintext
even before a full 16K of allowable record data is accumulated. Hence we
lose oppurtunity to encrypt and send a full 16K record.

With this patch, the kernel tls can accumulate full 16K of record data
irrespective of the size of data passed in sendmsg() with MSG_MORE.

Signed-off-by: Vakul Garg
Signed-off-by: David S. Miller

Vakul Garg
2019-01-18 03:42:26 +0800

28 Dec, 2018

1 commit

e0c38a4d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) New ipset extensions for matching on destination MAC addresses, from
Stefano Brivio.

2) Add ipv4 ttl and tos, plus ipv6 flow label and hop limit offloads to
nfp driver. From Stefano Brivio.

3) Implement GRO for plain UDP sockets, from Paolo Abeni.

4) Lots of work from Michał Mirosław to eliminate the VLAN_TAG_PRESENT
bit so that we could support the entire vlan_tci value.

5) Rework the IPSEC policy lookups to better optimize more usecases,
from Florian Westphal.

6) Infrastructure changes eliminating direct manipulation of SKB lists
wherever possible, and to always use the appropriate SKB list
helpers. This work is still ongoing...

7) Lots of PHY driver and state machine improvements and
simplifications, from Heiner Kallweit.

8) Various TSO deferral refinements, from Eric Dumazet.

9) Add ntuple filter support to aquantia driver, from Dmitry Bogdanov.

10) Batch dropping of XDP packets in tuntap, from Jason Wang.

11) Lots of cleanups and improvements to the r8169 driver from Heiner
Kallweit, including support for ->xmit_more. This driver has been
getting some much needed love since he started working on it.

12) Lots of new forwarding selftests from Petr Machata.

13) Enable VXLAN learning in mlxsw driver, from Ido Schimmel.

14) Packed ring support for virtio, from Tiwei Bie.

15) Add new Aquantia AQtion USB driver, from Dmitry Bezrukov.

16) Add XDP support to dpaa2-eth driver, from Ioana Ciocoi Radulescu.

17) Implement coalescing on TCP backlog queue, from Eric Dumazet.

18) Implement carrier change in tun driver, from Nicolas Dichtel.

19) Support msg_zerocopy in UDP, from Willem de Bruijn.

20) Significantly improve garbage collection of neighbor objects when
the table has many PERMANENT entries, from David Ahern.

21) Remove egdev usage from nfp and mlx5, and remove the facility
completely from the tree as it no longer has any users. From Oz
Shlomo and others.

22) Add a NETDEV_PRE_CHANGEADDR so that drivers can veto the change and
therefore abort the operation before the commit phase (which is the
NETDEV_CHANGEADDR event). From Petr Machata.

23) Add indirect call wrappers to avoid retpoline overhead, and use them
in the GRO code paths. From Paolo Abeni.

24) Add support for netlink FDB get operations, from Roopa Prabhu.

25) Support bloom filter in mlxsw driver, from Nir Dotan.

26) Add SKB extension infrastructure. This consolidates the handling of
the auxiliary SKB data used by IPSEC and bridge netfilter, and is
designed to support the needs to MPTCP which could be integrated in
the future.

27) Lots of XDP TX optimizations in mlx5 from Tariq Toukan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1845 commits)
net: dccp: fix kernel crash on module load
drivers/net: appletalk/cops: remove redundant if statement and mask
bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
net/net_namespace: Check the return value of register_pernet_subsys()
net/netlink_compat: Fix a missing check of nla_parse_nested
ieee802154: lowpan_header_create check must check daddr
net/mlx4_core: drop useless LIST_HEAD
mlxsw: spectrum: drop useless LIST_HEAD
net/mlx5e: drop useless LIST_HEAD
iptunnel: Set tun_flags in the iptunnel_metadata_reply from src
net/mlx5e: fix semicolon.cocci warnings
staging: octeon: fix build failure with XFRM enabled
net: Revert recent Spectre-v1 patches.
can: af_can: Fix Spectre v1 vulnerability
packet: validate address length if non-zero
nfc: af_nfc: Fix Spectre v1 vulnerability
phonet: af_phonet: Fix Spectre v1 vulnerability
net: core: Fix Spectre v1 vulnerability
net: minor cleanup in skb_ext_add()
net: drop the unused helper skb_ext_get()
...

Linus Torvalds
2018-12-28 05:04:52 +0800

27 Dec, 2018

1 commit

792bf4d87 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The biggest RCU changes in this cycle were:

- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

- Replace calls of RCU-bh and RCU-sched update-side functions to
their vanilla RCU counterparts. This series is a step towards
complete removal of the RCU-bh and RCU-sched update-side functions.

( Note that some of these conversions are going upstream via their
respective maintainers. )

- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.

- Miscellaneous fixes.

- Automate generation of the initrd filesystem used for rcutorture
testing.

- Convert spin_is_locked() assertions to instead use lockdep.

( Note that some of these conversions are going upstream via their
respective maintainers. )

- SRCU updates, especially including a fix from Dennis Krein for a
bag-on-head-class bug.

- RCU torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
rcutorture: Don't do busted forward-progress testing
rcutorture: Use 100ms buckets for forward-progress callback histograms
rcutorture: Recover from OOM during forward-progress tests
rcutorture: Print forward-progress test age upon failure
rcutorture: Print time since GP end upon forward-progress failure
rcutorture: Print histogram of CB invocation at OOM time
rcutorture: Print GP age upon forward-progress failure
rcu: Print per-CPU callback counts for forward-progress failures
rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
rcutorture: Dump grace-period diagnostics upon forward-progress OOM
rcutorture: Prepare for asynchronous access to rcu_fwd_startat
torture: Remove unnecessary "ret" variables
rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
rcutorture: Break up too-long rcu_torture_fwd_prog() function
rcutorture: Remove cbflood facility
torture: Bring any extra CPUs online during kernel startup
rcutorture: Add call_rcu() flooding forward-progress tests
rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
...

Linus Torvalds
2018-12-27 05:07:19 +0800

22 Dec, 2018

2 commits

ce28bb445 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2018-12-22 07:06:20 +0800
5c1e7e94a Prevent overflow of sk_msg in sk_msg_clone() ... Browse Code »

Fixed function sk_msg_clone() to prevent overflow of 'dst' while adding
pages in scatterlist entries. The overflow of 'dst' causes crash in kernel
tls module while doing record encryption.

Crash fixed by this patch.

[ 78.796119] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[ 78.804900] Mem abort info:
[ 78.807683] ESR = 0x96000004
[ 78.810744] Exception class = DABT (current EL), IL = 32 bits
[ 78.816677] SET = 0, FnV = 0
[ 78.819727] EA = 0, S1PTW = 0
[ 78.822873] Data abort info:
[ 78.825759] ISV = 0, ISS = 0x00000004
[ 78.829600] CM = 0, WnR = 0
[ 78.832576] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000bf8ee311
[ 78.839195] [0000000000000008] pgd=0000000000000000
[ 78.844081] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 78.849642] Modules linked in: tls xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables xt_CHECKSUM cpve cpufreq_conservative lm90 ina2xx crct10dif_ce
[ 78.865377] CPU: 0 PID: 6007 Comm: openssl Not tainted 4.20.0-rc6-01647-g754d5da63145-dirty #107
[ 78.874149] Hardware name: LS1043A RDB Board (DT)
[ 78.878844] pstate: 60000005 (nZCv daif -PAN -UAO)
[ 78.883632] pc : scatterwalk_copychunks+0x164/0x1c8
[ 78.888500] lr : scatterwalk_copychunks+0x160/0x1c8
[ 78.893366] sp : ffff00001d04b600
[ 78.896668] x29: ffff00001d04b600 x28: ffff80006814c680
[ 78.901970] x27: 0000000000000000 x26: ffff80006c8de786
[ 78.907272] x25: ffff00001d04b760 x24: 000000000000001a
[ 78.912573] x23: 0000000000000006 x22: ffff80006814e440
[ 78.917874] x21: 0000000000000100 x20: 0000000000000000
[ 78.923175] x19: 000081ffffffffff x18: 0000000000000400
[ 78.928476] x17: 0000000000000008 x16: 0000000000000000
[ 78.933778] x15: 0000000000000100 x14: 0000000000000001
[ 78.939079] x13: 0000000000001080 x12: 0000000000000020
[ 78.944381] x11: 0000000000001080 x10: 00000000ffff0002
[ 78.949683] x9 : ffff80006814c248 x8 : 00000000ffff0000
[ 78.954985] x7 : ffff80006814c318 x6 : ffff80006c8de786
[ 78.960286] x5 : 0000000000000f80 x4 : ffff80006c8de000
[ 78.965588] x3 : 0000000000000000 x2 : 0000000000001086
[ 78.970889] x1 : ffff7e0001b74e02 x0 : 0000000000000000
[ 78.976192] Process openssl (pid: 6007, stack limit = 0x00000000291367f9)
[ 78.982968] Call trace:
[ 78.985406] scatterwalk_copychunks+0x164/0x1c8
[ 78.989927] skcipher_walk_next+0x28c/0x448
[ 78.994099] skcipher_walk_done+0xfc/0x258
[ 78.998187] gcm_encrypt+0x434/0x4c0
[ 79.001758] tls_push_record+0x354/0xa58 [tls]
[ 79.006194] bpf_exec_tx_verdict+0x1e4/0x3e8 [tls]
[ 79.010978] tls_sw_sendmsg+0x650/0x780 [tls]
[ 79.015326] inet_sendmsg+0x2c/0xf8
[ 79.018806] sock_sendmsg+0x18/0x30
[ 79.022284] __sys_sendto+0x104/0x138
[ 79.025935] __arm64_sys_sendto+0x24/0x30
[ 79.029936] el0_svc_common+0x60/0xe8
[ 79.033588] el0_svc_handler+0x2c/0x80
[ 79.037327] el0_svc+0x8/0xc
[ 79.040200] Code: 6b01005f 54fff788 940169b1 f9000320 (b9400801)
[ 79.046283] ---[ end trace 74db007d069c1cf7 ]---

Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
Signed-off-by: Vakul Garg
Acked-by: John Fastabend
Signed-off-by: David S. Miller

Vakul Garg
2018-12-22 01:12:49 +0800

21 Dec, 2018

2 commits

a136678c0 bpf: sk_msg, zap ingress queue on psock down ... Browse Code »

In addition to releasing any cork'ed data on a psock when the psock
is removed we should also release any skb's in the ingress work queue.
Otherwise the skb's eventually get free'd but late in the tear
down process so we see the WARNING due to non-zero sk_forward_alloc.

void sk_stream_kill_queues(struct sock *sk)
{
...
WARN_ON(sk->sk_forward_alloc);
...
}

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann

John Fastabend
2018-12-21 06:47:09 +0800
552de9106 bpf: sk_msg, fix socket data_ready events ... Browse Code »

When a skb verdict program is in-use and either another BPF program
redirects to that socket or the new SK_PASS support is used the
data_ready callback does not wake up application. Instead because
the stream parser/verdict is using the sk data_ready callback we wake
up the stream parser/verdict block.

Fix this by adding a helper to check if the stream parser block is
enabled on the sk and if so call the saved pointer which is the
upper layers wake up function.

This fixes application stalls observed when an application is waiting
for data in a blocking read().

Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann

John Fastabend
2018-12-21 06:47:09 +0800