Eric Lee / smarc-fsl-linux-kernel

08 Feb, 2017

3 commits

3efa70d78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.

Signed-off-by: David S. Miller

David S. Miller
2017-02-08 05:29:30 +0800
0dec879f6 net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP ... Browse Code »

When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.

The datagram protocols can use MSG_CONFIRM to confirm the
neighbour. When used with MSG_PROBE we do not reach the
code where neighbour is confirmed, so we have to do the
same slow lookup by using the dst_confirm_neigh() helper.
When MSG_PROBE is not used, ip_append_data/ip6_append_data
will set the skb flag dst_pending_confirm.

Reported-by: YueHaibing
Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: Julian Anastasov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Julian Anastasov
2017-02-08 02:07:47 +0800
69629464e udp: properly cope with csum errors ... Browse Code »

Dmitry reported that UDP sockets being destroyed would trigger the
WARN_ON(atomic_read(&sk->sk_rmem_alloc)); in inet_sock_destruct()

It turns out we do not properly destroy skb(s) that have wrong UDP
checksum.

Thanks again to syzkaller team.

Fixes : 7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue")
Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Cc: Paolo Abeni
Cc: Hannes Frederic Sowa
Acked-by: Paolo Abeni
Signed-off-by: David S. Miller

Eric Dumazet
2017-02-08 00:19:00 +0800

31 Jan, 2017

1 commit

63a6fff35 net: Avoid receiving packets with an l3mdev on unbound UDP sockets ... Browse Code »

Packets arriving in a VRF currently are delivered to UDP sockets that
aren't bound to any interface. TCP defaults to not delivering packets
arriving in a VRF to unbound sockets. IP route lookup and socket
transmit both assume that unbound means using the default table and
UDP applications that haven't been changed to be aware of VRFs may not
function correctly in this case since they may not be able to handle
overlapping IP address ranges, or be able to send packets back to the
original sender if required.

So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
being analgous to the existing tcp_l3mdev_accept, namely to allow a
process to have a VRF-global listen socket. Have this default to off
as this is the behaviour that users will expect, given that there is
no explicit mechanism to set unmodified VRF-unaware application into a
default VRF.

Signed-off-by: Robert Shearman
Acked-by: David Ahern
Tested-by: David Ahern
Signed-off-by: David S. Miller

Robert Shearman
2017-01-31 04:00:58 +0800

19 Jan, 2017

1 commit

fe38d2a1c inet: collapse ipv4/v6 rcv_saddr_equal functions into one ... Browse Code »

We pass these per-protocol equal functions around in various places, but
we can just have one function that checks the sk->sk_family and then do
the right comparison function. I've also changed the ipv4 version to
not cast to inet_sock since it is unneeded.

Signed-off-by: Josef Bacik
Signed-off-by: David S. Miller

Josef Bacik
2017-01-19 02:04:28 +0800

07 Jan, 2017

1 commit

df560056d udp: inuse checks can quit early for reuseport ... Browse Code »

UDP lib inuse checks will walk the entire hash bucket to check if the
portaddr is in use. In the case of reuseport we can stop searching when
we find a matching reuseport.

On a 16-core VM a test program that spawns 16 threads that each bind to
1024 sockets (one per 10ms) takes 1m45s. With this change it takes 11s.

Also add a cond_resched() when the port is not specified.

Signed-off-by: Eric Garver
Signed-off-by: David S. Miller

Eric Garver
2017-01-07 09:56:48 +0800

25 Dec, 2016

1 commit

7c0f6ba68 Replace <asm/uaccess.h> with <linux/uaccess.h> globally ... Browse Code »

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
sed -i -e "s!$PATT!#include !" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-25 03:46:01 +0800

10 Dec, 2016

4 commits

02ab0d139 udp: udp_rmem_release() should touch sk_rmem_alloc later ... Browse Code »

In flood situations, keeping sk_rmem_alloc at a high value
prevents producers from touching the socket.

It makes sense to lower sk_rmem_alloc only at the end
of udp_rmem_release() after the thread draining receive
queue in udp_recvmsg() finished the writes to sk_forward_alloc.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-12-10 11:12:21 +0800
6b229cf77 udp: add batching to udp_rmem_release() ... Browse Code »

If udp_recvmsg() constantly releases sk_rmem_alloc
for every read packet, it gives opportunity for
producers to immediately grab spinlocks and desperatly
try adding another packet, causing false sharing.

We can add a simple heuristic to give the signal
by batches of ~25 % of the queue capacity.

This patch considerably increases performance under
flood by about 50 %, since the thread draining the queue
is no longer slowed by false sharing.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-12-10 11:12:21 +0800
c84d94905 udp: copy skb->truesize in the first cache line ... Browse Code »

In UDP RX handler, we currently clear skb->dev before skb
is added to receive queue, because device pointer is no longer
available once we exit from RCU section.

Since this first cache line is always hot, lets reuse this space
to store skb->truesize and thus avoid a cache line miss at
udp_recvmsg()/udp_skb_destructor time while receive queue
spinlock is held.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-12-10 11:12:21 +0800
4b272750d udp: add busylocks in RX path ... Browse Code »

Idea of busylocks is to let producers grab an extra spinlock
to relieve pressure on the receive_queue spinlock shared by consumer.

This behavior is requested only once socket receive queue is above
half occupancy.

Under flood, this means that only one producer can be in line
trying to acquire the receive_queue spinlock.

These busylock can be allocated on a per cpu manner, instead of a
per socket one (that would consume a cache line per socket)

This patch considerably improves UDP behavior under stress,
depending on number of NIC RX queues and/or RPS spread.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-12-10 11:12:21 +0800

09 Dec, 2016

1 commit

c8c8b1270 udp: under rx pressure, try to condense skbs ... Browse Code »

Under UDP flood, many softirq producers try to add packets to
UDP receive queue, and one user thread is burning one cpu trying
to dequeue packets as fast as possible.

Two parts of the per packet cost are :
- copying payload from kernel space to user space,
- freeing memory pieces associated with skb.

If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.

Meaning the softirq handler(s) can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later, possibly from another node.

Additional gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
- We avoid cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.

This comes at the cost of a copy, bounded to available tail room, which
is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
than necessary)

This patch gave me about 5 % increase in throughput in my tests.

skb_condense() helper could probably used in other contexts.

Signed-off-by: Eric Dumazet
Cc: Paolo Abeni
Signed-off-by: David S. Miller

Eric Dumazet
2016-12-09 02:25:07 +0800

04 Dec, 2016

1 commit

363dc73ac udp: be less conservative with sock rmem accounting ... Browse Code »

Before commit 850cbaddb52d ("udp: use it's own memory accounting
schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
the rcvbuf by the whole current packet's truesize. After said commit
we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
is empty. As reported by Jesper this cause a performance regression
for some (small) values of rcvbuf.

This commit is intended to fix the regression restoring the old
handling of the rcvbuf limit.

Reported-by: Jesper Dangaard Brouer
Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2016-12-04 05:14:48 +0800

27 Nov, 2016

1 commit

0b42f25d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.

Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2016-11-27 12:42:21 +0800

25 Nov, 2016

1 commit

30c7be26f udplite: call proper backlog handlers ... Browse Code »

In commits 93821778def10 ("udp: Fix rcv socket locking") and
f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
__udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
was forgotten.

This leads to crashes if UDPlite header is pulled twice, which happens
starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets
before queueing")

Bug found by syzkaller team, thanks a lot guys !

Note that backlog use in UDP/UDPlite is scheduled to be removed starting
from linux-4.10, so this patch is only needed up to linux-4.9

Fixes: 93821778def1 ("udp: Fix rcv socket locking")
Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet
Reported-by: Andrey Konovalov
Cc: Benjamin LaHaise
Cc: Herbert Xu
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-25 04:32:14 +0800

23 Nov, 2016

1 commit

f9aa9dc7d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware. If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller

David S. Miller
2016-11-23 02:27:16 +0800

22 Nov, 2016

1 commit

d21dbdfe0 udp: avoid one cache line miss in recvmsg() ... Browse Code »

UDP_SKB_CB(skb)->partial_cov is located at offset 66 in skb,
requesting a cold cache line being read in cpu cache.

We can avoid this cache line miss for UDP sockets,
as partial_cov has a meaning only for UDPLite.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-22 00:26:59 +0800

18 Nov, 2016

1 commit

e68b6e50f udp: enable busy polling for all sockets ... Browse Code »

UDP busy polling is restricted to connected UDP sockets.

This is because sk_busy_loop() only takes care of one NAPI context.

There are cases where it could be extended.

1) Some hosts receive traffic on a single NIC, with one RX queue.

2) Some applications use SO_REUSEPORT and associated BPF filter
to split the incoming traffic on one UDP socket per RX
queue/thread/cpu

3) Some UDP sockets are used to send/receive traffic for one flow, but
they do not bother with connect()

This patch records the napi_id of first received skb, giving more
reach to busy polling.

Tested:

lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
lpaa24:~# echo 70 >/proc/sys/net/core/busy_read

lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done

Before patch :
27867 28870 37324 41060 41215
36764 36838 44455 41282 43843
After patch :
73920 73213 70147 74845 71697
68315 68028 75219 70082 73707

Signed-off-by: Eric Dumazet
Cc: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-18 23:44:31 +0800

16 Nov, 2016

2 commits

73e2d5e34 udp: restore UDPlite many-cast delivery ... Browse Code »

Honor udptable parameter that is passed to __udp*_lib_mcast_deliver(),
otherwise udplite broadcast/multicast use the wrong table and it breaks.

Fixes: 2dc41cff7545 ("udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.")
Signed-off-by: Pablo Neira Ayuso
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Pablo Neira
2016-11-16 11:14:27 +0800
c915fe13c udplite: fix NULL pointer dereference ... Browse Code »

The commit 850cbaddb52d ("udp: use it's own memory accounting schema")
assumes that the socket proto has memory accounting enabled,
but this is not the case for UDPLITE.
Fix it enabling memory accounting for UDPLITE and performing
fwd allocated memory reclaiming on socket shutdown.
UDP and UDPLITE share now the same memory accounting limits.
Also drop the backlog receive operation, since is no more needed.

Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
Reported-by: Andrei Vagin
Suggested-by: Eric Dumazet
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2016-11-16 00:59:38 +0800

14 Nov, 2016

1 commit

7d384846b Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains a second batch of Netfilter updates for
your net-next tree. This includes a rework of the core hook
infrastructure that improves Netfilter performance by ~15% according to
synthetic benchmarks. Then, a large batch with ipset updates, including
a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a
couple of assorted updates.

Regarding the core hook infrastructure rework to improve performance,
using this simple drop-all packets ruleset from ingress:

nft add table netdev x
nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; }
nft add rule netdev x y drop

And generating traffic through Jesper Brouer's
samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i
option. perf report shows nf_tables calls in its top 10:

17.30% kpktgend_0 [nf_tables] [k] nft_do_chain
15.75% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core
10.39% kpktgend_0 [nf_tables_netdev] [k] nft_do_chain_netdev

I'm measuring here an improvement of ~15% in performance with this
patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R)
Core(TM) i5-3320M CPU @ 2.60GHz 4-cores.

This rework contains more specifically, in strict order, these patches:

1) Remove compile-time debugging from core.

2) Remove obsolete comments that predate the rcu era. These days it is
well known that a Netfilter hook always runs under rcu_read_lock().

3) Remove threshold handling, this is only used by br_netfilter too.
We already have specific code to handle this from br_netfilter,
so remove this code from the core path.

4) Deprecate NF_STOP, as this is only used by br_netfilter.

5) Place nf_state_hook pointer into xt_action_param structure, so
this structure fits into one single cacheline according to pahole.
This also implicit affects nftables since it also relies on the
xt_action_param structure.

6) Move state->hook_entries into nf_queue entry. The hook_entries
pointer is only required by nf_queue(), so we can store this in the
queue entry instead.

7) use switch() statement to handle verdict cases.

8) Remove hook_entries field from nf_hook_state structure, this is only
required by nf_queue, so store it in nf_queue_entry structure.

9) Merge nf_iterate() into nf_hook_slow() that results in a much more
simple and readable function.

10) Handle NF_REPEAT away from the core, so far the only client is
nf_conntrack_in() and we can restart the packet processing using a
simple goto to jump back there when the TCP requires it.
This update required a second pass to fix fallout, fix from
Arnd Bergmann.

11) Set random seed from nft_hash when no seed is specified from
userspace.

12) Simplify nf_tables expression registration, in a much smarter way
to save lots of boiler plate code, by Liping Zhang.

13) Simplify layer 4 protocol conntrack tracker registration, from
Davide Caratti.

14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due
to recent generalization of the socket infrastructure, from Arnd
Bergmann.

15) Then, the ipset batch from Jozsef, he describes it as it follows:

* Cleanup: Remove extra whitespaces in ip_set.h
* Cleanup: Mark some of the helpers arguments as const in ip_set.h
* Cleanup: Group counter helper functions together in ip_set.h
* struct ip_set_skbinfo is introduced instead of open coded fields
in skbinfo get/init helper funcions.
* Use kmalloc() in comment extension helper instead of kzalloc()
because it is unnecessary to zero out the area just before
explicit initialization.
* Cleanup: Split extensions into separate files.
* Cleanup: Separate memsize calculation code into dedicated function.
* Cleanup: group ip_set_put_extensions() and ip_set_get_extensions()
together.
* Add element count to hash headers by Eric B Munson.
* Add element count to all set types header for uniform output
across all set types.
* Count non-static extension memory into memsize calculation for
userspace.
* Cleanup: Remove redundant mtype_expire() arguments, because
they can be get from other parameters.
* Cleanup: Simplify mtype_expire() for hash types by removing
one level of intendation.
* Make NLEN compile time constant for hash types.
* Make sure element data size is a multiple of u32 for the hash set
types.
* Optimize hash creation routine, exit as early as possible.
* Make struct htype per ipset family so nets array becomes fixed size
and thus simplifies the struct htype allocation.
* Collapse same condition body into a single one.
* Fix reported memory size for hash:* types, base hash bucket structure
was not taken into account.
* hash:ipmac type support added to ipset by Tomasz Chilinski.
* Use setup_timer() and mod_timer() instead of init_timer()
by Muhammad Falak R Wani, individually for the set type families.

16) Remove useless connlabel field in struct netns_ct, patch from
Florian Westphal.

17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify
{ip,ip6,arp}tables code that uses this.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-11-14 11:41:25 +0800

10 Nov, 2016

1 commit

30f581584 udp: provide udp{4,6}_lib_lookup for nf_socket_ipv{4,6} ... Browse Code »

Since commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
the udp6_lib_lookup and udp4_lib_lookup functions are only
provided when it is actually possible to call them.

However, moving the callers now caused a link error:

net/built-in.o: In function `nf_sk_lookup_slow_v6':
(.text+0x131a39): undefined reference to `udp6_lib_lookup'
net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4':
nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup'

This extends the #ifdef so we also provide the functions when
CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively
are set.

Fixes: 8db4c5be88f6 ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c")
Signed-off-by: Arnd Bergmann
Signed-off-by: Pablo Neira Ayuso

Arnd Bergmann
2016-11-10 07:05:14 +0800

08 Nov, 2016

2 commits

7c13f97ff udp: do fwd memory scheduling on dequeue ... Browse Code »

A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.

Overall, this allows acquiring only once the receive queue
lock on dequeue.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr sinks vanilla patched
1 440 560
3 2150 2300
6 3650 3800
9 4450 4600
12 6250 6450

v1 -> v2:
- do rmem and allocated memory scheduling under the receive lock
- do bulk scheduling in first_packet_length() and in udp_destruct_sock()
- avoid the typdef for the dequeue callback

Suggested-by: Eric Dumazet
Acked-by: Hannes Frederic Sowa
Signed-off-by: Paolo Abeni
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Paolo Abeni
2016-11-08 02:24:41 +0800
ad959036a net/sock: add an explicit sk argument for ip_cmsg_recv_offset() ... Browse Code »

So that we can use it even after orphaining the skbuff.

Suggested-by: Eric Dumazet
Signed-off-by: Paolo Abeni
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Paolo Abeni
2016-11-08 02:24:41 +0800

05 Nov, 2016

1 commit

e2d118a1c net: inet: Support UID-based routing in IP protocols. ... Browse Code »

- Use the UID in routing lookups made by protocol connect() and
sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
(e.g., Path MTU discovery) take the UID of the socket into
account.
- For packets not associated with a userspace socket, (e.g., ping
replies) use UID 0 inside the user namespace corresponding to
the network namespace the socket belongs to. This allows
all namespaces to apply routing and iptables rules to
kernel-originated traffic in that namespaces by matching UID 0.
This is better than using the UID of the kernel socket that is
sending the traffic, because the UID of kernel sockets created
at namespace creation time (e.g., the per-processor ICMP and
TCP sockets) is the UID of the user that created the socket,
which might not be mapped in the namespace.

Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti
Signed-off-by: David S. Miller

Lorenzo Colitti
2016-11-05 02:45:23 +0800

31 Oct, 2016

1 commit

27058af40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller

David S. Miller
2016-10-31 00:42:58 +0800

27 Oct, 2016

1 commit

10df8e615 udp: fix IP_CHECKSUM handling ... Browse Code »

First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));

Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.

Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet
Cc: Sam Kumar
Cc: Willem de Bruijn
Tested-by: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2016-10-27 05:33:22 +0800

23 Oct, 2016

2 commits

850cbaddb udp: use it's own memory accounting schema ... Browse Code »

Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr readers Kpps (vanilla) Kpps (patched)
1 170 440
3 1250 2150
6 3000 3650
9 4200 4450
12 5700 6250

v4 -> v5:
- avoid unneeded test in first_packet_length

v3 -> v4:
- remove useless sk_rcvqueues_full() call

v2 -> v3:
- do not set the now unsed backlog_rcv callback

v1 -> v2:
- add memory pressure support
- fixed dropwatch accounting for ipv6

Acked-by: Hannes Frederic Sowa
Signed-off-by: Paolo Abeni
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Paolo Abeni
2016-10-23 05:05:05 +0800
f970bd9e3 udp: implement memory accounting helpers ... Browse Code »

Avoid using the generic helpers.
Use the receive queue spin lock to protect the memory
accounting operation, both on enqueue and on dequeue.

On dequeue perform partial memory reclaiming, trying to
leave a quantum of forward allocated memory.

On enqueue use a custom helper, to allow some optimizations:
- use a plain spin_lock() variant instead of the slightly
costly spin_lock_irqsave(),
- avoid dst_force check, since the calling code has already
dropped the skb dst
- avoid orphaning the skb, since skb_steal_sock() already did
the work for us

The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock().

v5 -> v6:
- don't orphan the skb on enqueue

v4 -> v5:
- replace the mem_lock with the receive queue spin lock
- ensure that the bh is always allowed to enqueue at least
a skb, even if sk_rcvbuf is exceeded

v3 -> v4:
- reworked memory accunting, simplifying the schema
- provide an helper for both memory scheduling and enqueuing

v1 -> v2:
- use a udp specific destrctor to perform memory reclaiming
- remove a couple of helpers, unneeded after the above cleanup
- do not reclaim memory on dequeue if not under memory
pressure
- reworked the fwd accounting schema to avoid potential
integer overflow

Acked-by: Hannes Frederic Sowa
Signed-off-by: Paolo Abeni
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Paolo Abeni
2016-10-23 05:05:05 +0800

21 Oct, 2016

1 commit

286c72dea udp: must lock the socket in udp_disconnect() ... Browse Code »

Baozeng Ding reported KASAN traces showing uses after free in
udp_lib_get_port() and other related UDP functions.

A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.

I could write a reproducer with two threads doing :

static int sock_fd;
static void *thr1(void *arg)
{
for (;;) {
connect(sock_fd, (const struct sockaddr *)arg,
sizeof(struct sockaddr_in));
}
}

static void *thr2(void *arg)
{
struct sockaddr_in unspec;

for (;;) {
memset(&unspec, 0, sizeof(unspec));
connect(sock_fd, (const struct sockaddr *)&unspec,
sizeof(unspec));
}
}

Problem is that udp_disconnect() could run without holding socket lock,
and this was causing list corruptions.

Signed-off-by: Eric Dumazet
Reported-by: Baozeng Ding
Signed-off-by: David S. Miller

Eric Dumazet
2016-10-21 02:45:52 +0800

11 Sep, 2016

1 commit

d66f6c0a8 net: ipv4: Remove l3mdev_get_saddr ... Browse Code »

No longer needed

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-09-11 14:12:53 +0800

30 Aug, 2016

1 commit

6abdd5f59 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller

David S. Miller
2016-08-30 12:54:02 +0800

24 Aug, 2016

4 commits

4cac82046 udp: get rid of sk_prot_clear_portaddr_nulls() ... Browse Code »

Since we no longer use SLAB_DESTROY_BY_RCU for UDP,
we do not need sk_prot_clear_portaddr_nulls() helper.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-08-24 14:25:29 +0800
5d77dca82 net: diag: support SOCK_DESTROY for UDP sockets ... Browse Code »

This implements SOCK_DESTROY for UDP sockets similar to what was done
for TCP with commit c1e64e298b8ca ("net: diag: Support destroying TCP
sockets.") A process with a UDP socket targeted for destroy is awakened
and recvmsg fails with ECONNABORTED.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-08-24 14:12:27 +0800
75d855a5e udp: get rid of SLAB_DESTROY_BY_RCU allocations ... Browse Code »

After commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
we do not need this special allocation mode anymore, even if it is
harmless.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-08-24 08:46:17 +0800
e83c6744e udp: fix poll() issue with zero sized packets ... Browse Code »

Laura tracked poll() [and friends] regression caused by commit
e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

udp_poll() needs to know if there is a valid packet in receive queue,
even if its payload length is 0.

Change first_packet_length() to return an signed int, and use -1
as the indication of an empty queue.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Reported-by: Laura Abbott
Signed-off-by: Eric Dumazet
Tested-by: Laura Abbott
Signed-off-by: David S. Miller

Eric Dumazet
2016-08-24 07:39:14 +0800

20 Aug, 2016

1 commit

217375a0c udp: include addrconf.h ... Browse Code »

Include ipv4_rcv_saddr_equal() definition to avoid this sparse error :

net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not
declared. Should it be static?

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-08-20 08:06:58 +0800

26 Jul, 2016

1 commit

ba66bbe54 udp: use sk_filter_trim_cap for udp{,6}_queue_rcv_skb ... Browse Code »

After a612769774a3 ("udp: prevent bugcheck if filter truncates packet
too much"), there followed various other fixes for similar cases such
as f4979fcea7fd ("rose: limit sk_filter trim to payload").

Latter introduced a new helper sk_filter_trim_cap(), where we can pass
the trim limit directly to the socket filter handling. Make use of it
here as well with sizeof(struct udphdr) as lower cap limit and drop the
extra skb->len test in UDP's input path.

Signed-off-by: Daniel Borkmann
Cc: Willem de Bruijn
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Daniel Borkmann
2016-07-26 12:40:33 +0800

12 Jul, 2016

1 commit

a61276977 udp: prevent bugcheck if filter truncates packet too much ... Browse Code »

If socket filter truncates an udp packet below the length of UDP header
in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
kernel is configured that way) can be easily enforced by an unprivileged
user which was reported as CVE-2016-6162. For a reproducer, see
http://seclists.org/oss-sec/2016/q3/8

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Reported-by: Marco Grassi
Signed-off-by: Michal Kubecek
Acked-by: Eric Dumazet
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Michal Kubeček
2016-07-12 03:43:15 +0800

15 Jun, 2016

1 commit

d1e37288c udp reuseport: fix packet of same flow hashed to different socket ... Browse Code »

There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:

1) When hslot->count hash,
and always passes 'daddr' to udp_ehashfn().

2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.

That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.

This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
2) From the same host send udp packets to 127.0.0.1:40000, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:40000, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:40000.

It's the same case for IPv6, and this patch also fixes that.

Signed-off-by: Su, Xuemin
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Su, Xuemin
2016-06-15 05:23:09 +0800