Eric Lee / smarc-fsl-linux-kernel

14 Apr, 2016

1 commit

309cf37fe packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface ... Browse Code »

Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
pdiag_put_mclist() leaks uninitialized heap bytes via the
PACKET_DIAG_MCLIST netlink attribute.

Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].

Fixes: eea68e2f1a00 ("packet: Report socket mclist info via diag module")
Signed-off-by: Mathias Krause
Cc: Eric W. Biederman
Cc: Pavel Emelyanov
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Mathias Krause
2016-04-14 12:46:39 +0800

07 Apr, 2016

1 commit

6ae81ced3 af_packet: tone down the Tx-ring unsupported spew. ... Browse Code »

Trinity and other fuzzers can hit this WARN on far too easily,
resulting in a tainted kernel that hinders automated fuzzing.

Replace it with a rate-limited printk.

Signed-off-by: Dave Jones
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Dave Jones
2016-04-07 04:05:20 +0800

10 Mar, 2016

1 commit

9ed988cd5 packet: validate variable length ll headers ... Browse Code »

Replace link layer header validation check ll_header_truncate with
more generic dev_validate_header.

Validation based on hard_header_len incorrectly drops valid packets
in variable length protocols, such as AX25. dev_validate_header
calls header_ops.validate for such protocols to ensure correctness
below hard_header_len.

See also http://comments.gmane.org/gmane.linux.network/401064

Fixes 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 header")
Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2016-03-10 11:13:01 +0800

26 Feb, 2016

1 commit

7cad1bac9 net: core: use __ethtool_get_ksettings ... Browse Code »

Signed-off-by: David Decotigny
Signed-off-by: David S. Miller

David Decotigny
2016-02-26 11:06:47 +0800

09 Feb, 2016

4 commits

1d036d25e packet: tpacket_snd gso and checksum offload ... Browse Code »

Support socket option PACKET_VNET_HDR together with PACKET_TX_RING.

When enabled, a struct virtio_net_hdr is expected to precede the data
in the ring. The vnet option must be set before the ring is created.

The implementation reuses the existing skb_copy_bits code that is used
when dev->hard_header_len is non-zero. Move this ll_header check to
before the skb alloc and combine it with a test for vnet_hdr->hdr_len.
Allocate and copy the max of the two.

Verified with test program at
github.com/wdebruij/kerneltools/blob/master/tests/psock_txring_vnet.c

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2016-02-09 19:43:50 +0800
8d39b4a6b packet: parse tpacket header before skb alloc ... Browse Code »

GSO packet headers must be stored in the linear skb segment.
Move tpacket header parsing before sock_alloc_send_skb. The GSO
follow-on patch will later increase the skb linear argument to
sock_alloc_send_skb if needed for large packets.

The header parsing code does not require an allocated skb, so is
safe to move. Later pass to tpacket_fill_skb the computed data
start and length.

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2016-02-09 19:43:50 +0800
58d19b19c packet: vnet_hdr support for tpacket_rcv ... Browse Code »

Support socket option PACKET_VNET_HDR together with PACKET_RX_RING.
When enabled, a struct virtio_net_hdr will precede the data in the
packet ring slots.

Verified with test program at
github.com/wdebruij/kerneltools/blob/master/tests/psock_rxring_vnet.c

pkt: 1454269209.798420 len=5066
vnet: gso_type=tcpv4 gso_size=1448 hlen=66 ecn=off
csum: start=34 off=16
eth: proto=0x800
ip: src= dst= proto=6 len=5052

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2016-02-09 19:43:50 +0800
16cc14004 packet: move vnet_hdr code to helper functions ... Browse Code »

packet_snd and packet_rcv support virtio net headers for GSO.
Move this logic into helper functions to be able to reuse it in
tpacket_snd and tpacket_rcv.

This is a straighforward code move with one exception. Instead of
creating and passing a separate gso_type variable, reuse
vnet_hdr.gso_type after conversion from virtio to kernel gso type.

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2016-02-09 19:43:50 +0800

30 Nov, 2015

1 commit

880621c26 packet: Allow packets with only a header (but no payload) ... Browse Code »

Commit 9c7077622dd91 ("packet: make packet_snd fail on len smaller
than l2 header") added validation for the packet size in packet_snd.
This change enforces that every packet needs a header (with at least
hard_header_len bytes) plus a payload with at least one byte. Before
this change the payload was optional.

This fixes PPPoE connections which do not have a "Service" or
"Host-Uniq" configured (which is violating the spec, but is still
widely used in real-world setups). Those are currently failing with the
following message: "pppd: packet size is too short (24
Signed-off-by: David S. Miller

Martin Blumenstingl
2015-11-30 11:17:17 +0800

18 Nov, 2015

2 commits

90836b67e packet: Use PAGE_ALIGNED macro ... Browse Code »

Use PAGE_ALIGNED(...) instead of open-coding it.

Signed-off-by: Tobias Klauser
Signed-off-by: David S. Miller

Tobias Klauser
2015-11-18 04:25:44 +0800
4194b4914 packet: Don't check frames_per_block against negative values ... Browse Code »

rb->frames_per_block is an unsigned int, thus can never be negative.

Also fix spacing in the calculation of frames_per_block.

Signed-off-by: Tobias Klauser
Signed-off-by: David S. Miller

Tobias Klauser
2015-11-18 04:25:44 +0800

16 Nov, 2015

5 commits

5cfb4c8d0 packet: fix tpacket_snd max frame len ... Browse Code »

Since it's introduction in commit 69e3c75f4d54 ("net: TX_RING and
packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW
side. When used with SOCK_DGRAM only, the size_max > dev->mtu +
reserve check should have reserve as 0, but currently, this is
unconditionally set (in it's original form as dev->hard_header_len).

I think this is not correct since tpacket_fill_skb() would then
take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM,
the extra VLAN_HLEN could be possible in both cases. Presumably, the
reserve code was copied from packet_snd(), but later on missed the
check. Make it similar as we have it in packet_snd().

Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap")
Signed-off-by: Daniel Borkmann
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Daniel Borkmann
2015-11-16 07:00:35 +0800
c72219b75 packet: infer protocol from ethernet header if unset ... Browse Code »

In case no struct sockaddr_ll has been passed to packet
socket's sendmsg() when doing a TX_RING flush run, then
skb->protocol is set to po->num instead, which is the protocol
passed via socket(2)/bind(2).

Applications only xmitting can go the path of allocating the
socket as socket(PF_PACKET, , 0) and do a bind(2) on the
TX_RING with sll_protocol of 0. That way, register_prot_hook()
is neither called on creation nor on bind time, which saves
cycles when there's no interest in capturing anyway.

That leaves us however with po->num 0 instead and therefore
the TX_RING flush run sets skb->protocol to 0 as well. Eric
reported that this leads to problems when using tools like
trafgen over bonding device. I.e. the bonding's hash function
could invoke the kernel's flow dissector, which depends on
skb->protocol being properly set. In the current situation, all
the traffic is then directed to a single slave.

Fix it up by inferring skb->protocol from the Ethernet header
when not set and we have ARPHRD_ETHER device type. This is only
done in case of SOCK_RAW and where we have a dev->hard_header_len
length. In case of ARPHRD_ETHER devices, this is guaranteed to
cover ETH_HLEN, and therefore being accessed on the skb after
the skb_store_bits().

Reported-by: Eric Dumazet
Signed-off-by: Daniel Borkmann
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Daniel Borkmann
2015-11-16 07:00:35 +0800
3c70c1324 packet: only allow extra vlan len on ethernet devices ... Browse Code »

Packet sockets can be used by various net devices and are not
really restricted to ARPHRD_ETHER device types. However, when
currently checking for the extra 4 bytes that can be transmitted
in VLAN case, our assumption is that we generally probe on
ARPHRD_ETHER devices. Therefore, before looking into Ethernet
header, check the device type first.

This also fixes the issue where non-ARPHRD_ETHER devices could
have no dev->hard_header_len in TX_RING SOCK_RAW case, and thus
the check would test unfilled linear part of the skb (instead
of non-linear).

Fixes: 57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.")
Fixes: 52f1454f629f ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Daniel Borkmann
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Daniel Borkmann
2015-11-16 07:00:35 +0800
8fd6c80d9 packet: always probe for transport header ... Browse Code »

We concluded that the skb_probe_transport_header() should better be
called unconditionally. Avoiding the call into the flow dissector has
also not really much to do with the direct xmit mode.

While it seems that only virtio_net code makes use of GSO from non
RX/TX ring packet socket paths, we should probe for a transport header
nevertheless before they hit devices.

Reference: http://thread.gmane.org/gmane.linux.network/386173/
Signed-off-by: Daniel Borkmann
Acked-by: Jason Wang
Signed-off-by: David S. Miller

Daniel Borkmann
2015-11-16 07:00:35 +0800
efdfa2f78 packet: do skb_probe_transport_header when we actually have data ... Browse Code »

In tpacket_fill_skb() commit c1aad275b029 ("packet: set transport
header before doing xmit") and later on 40893fd0fd4e ("net: switch
to use skb_probe_transport_header()") was probing for a transport
header on the skb from a ring buffer slot, but at a time, where
the skb has _not even_ been filled with data yet. So that call into
the flow dissector is pretty useless. Lets do it after we've set
up the skb frags.

Fixes: c1aad275b029 ("packet: set transport header before doing xmit")
Reported-by: Eric Dumazet
Signed-off-by: Daniel Borkmann
Acked-by: Jason Wang
Signed-off-by: David S. Miller

Daniel Borkmann
2015-11-16 07:00:35 +0800

06 Nov, 2015

1 commit

30f7ea1c2 packet: race condition in packet_bind ... Browse Code »

There is a race conditions between packet_notifier and packet_bind{_spkt}.

It happens if packet_notifier(NETDEV_UNREGISTER) executes between the
time packet_bind{_spkt} takes a reference on the new netdevice and the
time packet_do_bind sets po->ifindex.
In this case the notification can be missed.
If this happens during a dev_change_net_namespace this can result in the
netdevice to be moved to the new namespace while the packet_sock in the
old namespace still holds a reference on it. When the netdevice is later
deleted in the new namespace the deletion hangs since the packet_sock
is not found in the new namespace' &net->packet.sklist.
It can be reproduced with the script below.

This patch makes packet_do_bind check again for the presence of the
netdevice in the packet_sock's namespace after the synchronize_net
in unregister_prot_hook.
More in general it also uses the rcu lock for the duration of the bind
to stop dev_change_net_namespace/rollback_registered_many from
going past the synchronize_net following unlist_netdevice, so that
no NETDEV_UNREGISTER notifications can happen on the new netdevice
while the bind is executing. In order to do this some code from
packet_bind{_spkt} is consolidated into packet_do_dev.

import socket, os, time, sys
proto=7
realDev='em1'
vlanId=400
if len(sys.argv) > 1:
vlanId=int(sys.argv[1])
dev='vlan%d' % vlanId

os.system('taskset -p 0x10 %d' % os.getpid())

s = socket.socket(socket.PF_PACKET, socket.SOCK_RAW, proto)
os.system('ip link add link %s name %s type vlan id %d' %
(realDev, dev, vlanId))
os.system('ip netns add dummy')

pid=os.fork()

if pid == 0:
# dev should be moved while packet_do_bind is in synchronize net
os.system('taskset -p 0x20000 %d' % os.getpid())
os.system('ip link set %s netns dummy' % dev)
os.system('ip netns exec dummy ip link del %s' % dev)
s.close()
sys.exit(0)

time.sleep(.004)
try:
s.bind(('%s' % dev, proto+1))
except:
print 'Could not bind socket'
s.close()
os.system('ip netns del dummy')
sys.exit(0)

os.waitpid(pid, 0)
s.close()
os.system('ip netns del dummy')
sys.exit(0)

Signed-off-by: Francesco Ruggeri
Signed-off-by: David S. Miller

Francesco Ruggeri
2015-11-06 03:48:42 +0800

13 Oct, 2015

3 commits

19bcf9f20 ipv4: Pass struct net into ip_defrag and ip_check_defrag ... Browse Code »

The function ip_defrag is called on both the input and the output
paths of the networking stack. In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.

So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.

Signed-off-by: "Eric W. Biederman"
Acked-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller

Eric W. Biederman
2015-10-13 10:44:16 +0800
161642e24 packet: fix match_fanout_group() ... Browse Code »

Recent TCP listener patches exposed a prior af_packet bug :
match_fanout_group() blindly assumes it is always safe
to cast sk to a packet socket to compare fanout with af_packet_priv

But SYNACK packets can be sent while attached to request_sock, which
are smaller than a "struct sock".

We can read non existent memory and crash.

Fixes: c0de08d04215 ("af_packet: don't emit packet on orig fanout group")
Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet
Cc: Willem de Bruijn
Cc: Eric Leblond
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-13 10:42:38 +0800
c7d39e326 packet: support per-packet fwmark for af_packet sendmsg ... Browse Code »

Signed-off-by: Edward Hyunkoo Jee
Signed-off-by: Eric Dumazet
Cc: Willem de Bruijn
Signed-off-by: David S. Miller

Edward Jee
2015-10-13 10:25:22 +0800

11 Oct, 2015

1 commit

ff936a04e bpf: fix cb access in socket filter programs ... Browse Code »

eBPF socket filter programs may see junk in 'u32 cb[5]' area,
since it could have been used by protocol layers earlier.

For socket filter programs used in af_packet we need to clean
20 bytes of skb->cb area if it could be used by the program.
For programs attached to TCP/UDP sockets we need to save/restore
these 20 bytes, since it's used by protocol layers.

Remove SK_RUN_FILTER macro, since it's no longer used.

Long term we may move this bpf cb area to per-cpu scratch, but that
requires addition of new 'per-cpu load/store' instructions,
so not suitable as a short term fix.

Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields")
Reported-by: Eric Dumazet
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-10-11 19:40:05 +0800

05 Oct, 2015

1 commit

bab189918 bpf, seccomp: prepare for upcoming criu support ... Browse Code »

The current ongoing effort to dump existing cBPF seccomp filters back
to user space requires to hold the pre-transformed instructions like
we do in case of socket filters from sk_attach_filter() side, so they
can be reloaded in original form at a later point in time by utilities
such as criu.

To prepare for this, simply extend the bpf_prog_create_from_user()
API to hold a flag that tells whether we should store the original
or not. Also, fanout filters could make use of that in future for
things like diag. While fanout filters already use bpf_prog_destroy(),
move seccomp over to them as well to handle original programs when
present.

Signed-off-by: Daniel Borkmann
Cc: Tycho Andersen
Cc: Pavel Emelyanov
Cc: Kees Cook
Cc: Andy Lutomirski
Cc: Alexei Starovoitov
Tested-by: Tycho Andersen
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-10-05 21:47:05 +0800

24 Sep, 2015

1 commit

d3869efe7 Fix AF_PACKET ABI breakage in 4.2 ... Browse Code »

Commit 7d82410950aa ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.

This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.

Signed-off-by: David Woodhouse
Signed-off-by: David S. Miller

David Woodhouse
2015-09-24 05:33:55 +0800

18 Aug, 2015

2 commits

f2e520956 packet: add extended BPF fanout mode ... Browse Code »

Add fanout mode PACKET_FANOUT_EBPF that accepts an en extended BPF
program to select a socket.

Update the internal eBPF program by passing to socket option
SOL_PACKET/PACKET_FANOUT_DATA a file descriptor returned by bpf().

Signed-off-by: Willem de Bruijn
Acked-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Willem de Bruijn
2015-08-18 05:22:48 +0800
47dceb8ec packet: add classic BPF fanout mode ... Browse Code »

Add fanout mode PACKET_FANOUT_CBPF that accepts a classic BPF program
to select a socket.

This avoids having to keep adding special case fanout modes. One
example use case is application layer load balancing. The QUIC
protocol, for instance, encodes a connection ID in UDP payload.

Also add socket option SOL_PACKET/PACKET_FANOUT_DATA that updates data
associated with the socket group. Fanout mode PACKET_FANOUT_CBPF is the
only user so far.

Signed-off-by: Willem de Bruijn
Acked-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Willem de Bruijn
2015-08-18 05:22:47 +0800

01 Aug, 2015

1 commit

5510b3c2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c

All four conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller

David S. Miller
2015-08-01 14:52:20 +0800

30 Jul, 2015

1 commit

73d0fcf2f packet: remove handling of tx_ring from prb_shutdown_retire_blk_timer() ... Browse Code »

Follow e8e85cc5eb57 ("packet: remove handling of tx_ring") and remove
the tx_ring parameter from prb_shutdown_retire_blk_timer() as it is only
called with tx_ring = 0.

Signed-off-by: Tobias Klauser
Signed-off-by: David S. Miller

Tobias Klauser
2015-07-30 06:11:12 +0800

29 Jul, 2015

1 commit

dbd46ab41 packet: tpacket_snd(): fix signed/unsigned comparison ... Browse Code »

tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:

tp_len > dev->mtu + dev->hard_header_len

as dev->mtu and dev->hard_header_len are both unsigned.

That may lead to just returning an incorrect EMSGSIZE errno
to the user.

Fixes: 52f1454f629fa ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Alexander Drozdov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexander Drozdov
2015-07-29 15:09:58 +0800

28 Jul, 2015

1 commit

158cd4af8 packet: missing dev_put() in packet_do_bind() ... Browse Code »

When binding a PF_PACKET socket, the use count of the bound interface is
always increased with dev_hold in dev_get_by_{index,name}. However,
when rebound with the same protocol and device as in the previous bind
the use count of the interface was not decreased. Ultimately, this
caused the deletion of the interface to fail with the following message:

unregister_netdevice: waiting for dummy0 to become free. Usage count = 1

This patch moves the dev_put out of the conditional part that was only
executed when either the protocol or device changed on a bind.

Fixes: 902fefb82ef7 ('packet: improve socket create/bind latency in some cases')
Signed-off-by: Lars Westerhoff
Signed-off-by: Dan Carpenter
Reviewed-by: Daniel Borkmann
Signed-off-by: David S. Miller

Lars Westerhoff
2015-07-28 06:38:58 +0800

24 Jun, 2015

1 commit

3a07bd6fe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/mellanox/mlx4/main.c
net/packet/af_packet.c

Both conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2015-06-24 17:58:51 +0800

23 Jun, 2015

1 commit

e8e85cc5e packet: remove handling of tx_ring ... Browse Code »

Remove handling of tx_ring in prb_setup_retire_blk_timer
for TPACKET_V3 because init_prb_bdqc is called only for zero tx_ring
and thus prb_setup_retire_blk_timer for zero tx_ring only.

And also in functon init_prb_bdqc there is no usage of tx_ring.
Thus removing tx_ring from init_prb_bdqc.

Signed-off-by: Maninder Singh
Suggested-by: Frans Klaver
Signed-off-by: David S. Miller

Maninder Singh
2015-06-23 21:53:29 +0800

22 Jun, 2015

3 commits

468479e60 packet: avoid out of bounds read in round robin fanout ... Browse Code »

PACKET_FANOUT_LB computes f->rr_cur such that it is modulo
f->num_members. It returns the old value unconditionally, but
f->num_members may have changed since the last store. Ensure
that the return value is always < num.

When modifying the logic, simplify it further by replacing the loop
with an unconditional atomic increment.

Fixes: dc99f600698d ("packet: Add fanout support.")
Suggested-by: Eric Dumazet
Signed-off-by: Willem de Bruijn
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Willem de Bruijn
2015-06-22 01:24:37 +0800
59f211181 packet: free packet_rollover after synchronize_net ... Browse Code »

Destruction of the po->rollover must be delayed until there are no
more packets in flight that can access it. The field is destroyed in
packet_release, before synchronize_net. Delay using rcu.

Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")

Suggested-by: Eric Dumazet
Signed-off-by: Willem de Bruijn
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Willem de Bruijn
2015-06-22 00:30:42 +0800
f98f4514d packet: read num_members once in packet_rcv_fanout() ... Browse Code »

We need to tell compiler it must not read f->num_members multiple
times. Otherwise testing if num is not zero is flaky, and we could
attempt an invalid divide by 0 in fanout_demux_cpu()

Note bug was present in packet_rcv_fanout_hash() and
packet_rcv_fanout_lb() but final 3.1 had a simple location
after commit 95ec3eb417115fb ("packet: Add 'cpu' fanout policy.")

Fixes: dc99f600698dc ("packet: Add fanout support.")
Signed-off-by: Eric Dumazet
Cc: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2015-06-22 00:23:22 +0800

18 May, 2015

1 commit

4633c9e07 net-packet: fix null pointer exception in rollover mode ... Browse Code »

Rollover can be enabled as flag or mode. Allocate state in both cases.
This solves a NULL pointer exception in fanout_demux_rollover on
referencing po->rollover if using mode rollover.

Also make sure that in rollover mode each silo is tried (contrary
to rollover flag, where the main socket is excluded after an initial
try_self).

Tested:
Passes tools/testing/net/psock_fanout.c, which tests both modes and
flag. My previous tests were limited to bench_rollover, which only
stresses the flag. The test now completes safely. it still gives an
error for mode rollover, because it does not expect the new headroom
(ROOM_NORMAL) requirement. I will send a separate patch to the test.

Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")

Signed-off-by: Willem de Bruijn

----

I should have run this test and caught this before submission, of
course. Apologies for the oversight.
Signed-off-by: David S. Miller

Willem de Bruijn
2015-05-18 10:41:38 +0800

15 May, 2015

1 commit

54d7c01d3 packet: fix warnings in rollover lock contention ... Browse Code »

Avoid two xchg calls whose return values were unused, causing a
warning on some architectures.

The relevant variable is a hint and read without mutual exclusion.
This fix makes all writers hold the receive_queue lock.

Suggested-by: David S. Miller
Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2015-05-15 05:40:54 +0800

14 May, 2015

4 commits

a9b639181 packet: rollover statistics ... Browse Code »

Rollover indicates exceptional conditions. Export a counter to inform
socket owners of this state.

If no socket with sufficient room is found, rollover fails. Also count
these events.

Finally, also count when flows are rolled over early thanks to huge
flow detection, to validate its correctness.

Tested:
Read counters in bench_rollover on all other tests in the patchset

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2015-05-14 03:43:00 +0800
3b3a5b0aa packet: rollover huge flows before small flows ... Browse Code »

Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.

Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.

Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
cpu rx rx.k drop.k rollover r.huge r.failed
0 14 14 0 0 0 0
1 20 20 0 0 0 0
2 16 16 0 0 0 0
3 6168824 6168824 0 4867721 4867721 0
4 4867741 4867741 0 0 0 0
5 12 12 0 0 0 0
6 15 15 0 0 0 0
7 17 17 0 0 0 0

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2015-05-14 03:43:00 +0800
2ccdbaa6d packet: rollover lock contention avoidance ... Browse Code »

Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.

Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.

Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.

Tested:
Ran bench_rollover: a process with 8 sockets in a single fanout
group, each pinned to a single cpu that receives one nic recv
interrupt. RPS and RFS are disabled. The benchmark uses packet
rx_ring, which has to take a lock when determining whether a
socket has room.

Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
uniformly across the packet sockets (and inserted an iptables
rule to drop in PREROUTING to avoid protocol stack processing).

Without this patch, all sockets try to migrate traffic to
neighbors, causing lock contention when searching for a non-
empty neighbor. The lock is the top 9 entries.

perf record -a -g sleep 5

- 17.82% bench_rollover [kernel.kallsyms] [k] _raw_spin_lock
- _raw_spin_lock
- 99.00% spin_lock
+ 81.77% packet_rcv_has_room.isra.41
+ 18.23% tpacket_rcv
+ 0.84% packet_rcv_has_room.isra.41
+ 5.20% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.15% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.14% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.12% ksoftirqd/7 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.12% ksoftirqd/5 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.10% ksoftirqd/4 [kernel.kallsyms] [k] _raw_spin_lock
+ 4.66% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock
+ 4.45% ksoftirqd/3 [kernel.kallsyms] [k] _raw_spin_lock
+ 1.55% bench_rollover [kernel.kallsyms] [k] packet_rcv_has_room.isra.41

On net-next with this patch, this lock contention is no longer a
top entry. Most time is spent in the actual read function. Next up
are other locks:

+ 15.52% bench_rollover bench_rollover [.] reader
+ 4.68% swapper [kernel.kallsyms] [k] memcpy_erms
+ 2.77% swapper [kernel.kallsyms] [k] packet_lookup_frame.isra.51
+ 2.56% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms
+ 2.16% swapper [kernel.kallsyms] [k] tpacket_rcv
+ 1.93% swapper [kernel.kallsyms] [k] mlx4_en_process_rx_cq

Looking closer at the remaining _raw_spin_lock, the cost of probing
in rollover is now comparable to the cost of taking the lock later
in tpacket_rcv.

- 1.51% swapper [kernel.kallsyms] [k] _raw_spin_lock
- _raw_spin_lock
+ 33.41% packet_rcv_has_room
+ 28.15% tpacket_rcv
+ 19.54% enqueue_to_backlog
+ 6.45% __free_pages_ok
+ 2.78% packet_rcv_fanout
+ 2.13% fanout_demux_rollover
+ 2.01% netif_receive_skb_internal

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2015-05-14 03:43:00 +0800
9954729bc packet: rollover only to socket with headroom ... Browse Code »

Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.

The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.

Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

bench_rollover has as many sockets as there are NIC receive queues
in the system. Each socket is owned by a process that is pinned to
one of the receive cpus. RFS is disabled. RPS is enabled with an
identity mapping (cpu x -> cpu x), to count drops with softnettop.

lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
Press [Enter] to exit

cpu rx rx.k drop.k rollover r.huge r.failed
0 16 16 0 0 0 0
1 21 21 0 0 0 0
2 5227502 5227502 0 0 0 0
3 18 18 0 0 0 0
4 6083289 6083289 0 5227496 0 0
5 22 22 0 0 0 0
6 21 21 0 0 0 0
7 9 9 0 0 0 0

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2015-05-14 03:42:59 +0800