14 Apr, 2016

1 commit

  • Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
    pdiag_put_mclist() leaks uninitialized heap bytes via the
    PACKET_DIAG_MCLIST netlink attribute.

    Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].

    Fixes: eea68e2f1a00 ("packet: Report socket mclist info via diag module")
    Signed-off-by: Mathias Krause
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Mathias Krause
     

07 Apr, 2016

1 commit


10 Mar, 2016

1 commit

  • Replace link layer header validation check ll_header_truncate with
    more generic dev_validate_header.

    Validation based on hard_header_len incorrectly drops valid packets
    in variable length protocols, such as AX25. dev_validate_header
    calls header_ops.validate for such protocols to ensure correctness
    below hard_header_len.

    See also http://comments.gmane.org/gmane.linux.network/401064

    Fixes 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 header")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

26 Feb, 2016

1 commit


09 Feb, 2016

4 commits

  • Support socket option PACKET_VNET_HDR together with PACKET_TX_RING.

    When enabled, a struct virtio_net_hdr is expected to precede the data
    in the ring. The vnet option must be set before the ring is created.

    The implementation reuses the existing skb_copy_bits code that is used
    when dev->hard_header_len is non-zero. Move this ll_header check to
    before the skb alloc and combine it with a test for vnet_hdr->hdr_len.
    Allocate and copy the max of the two.

    Verified with test program at
    github.com/wdebruij/kerneltools/blob/master/tests/psock_txring_vnet.c

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • GSO packet headers must be stored in the linear skb segment.
    Move tpacket header parsing before sock_alloc_send_skb. The GSO
    follow-on patch will later increase the skb linear argument to
    sock_alloc_send_skb if needed for large packets.

    The header parsing code does not require an allocated skb, so is
    safe to move. Later pass to tpacket_fill_skb the computed data
    start and length.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Support socket option PACKET_VNET_HDR together with PACKET_RX_RING.
    When enabled, a struct virtio_net_hdr will precede the data in the
    packet ring slots.

    Verified with test program at
    github.com/wdebruij/kerneltools/blob/master/tests/psock_rxring_vnet.c

    pkt: 1454269209.798420 len=5066
    vnet: gso_type=tcpv4 gso_size=1448 hlen=66 ecn=off
    csum: start=34 off=16
    eth: proto=0x800
    ip: src= dst= proto=6 len=5052

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • packet_snd and packet_rcv support virtio net headers for GSO.
    Move this logic into helper functions to be able to reuse it in
    tpacket_snd and tpacket_rcv.

    This is a straighforward code move with one exception. Instead of
    creating and passing a separate gso_type variable, reuse
    vnet_hdr.gso_type after conversion from virtio to kernel gso type.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

30 Nov, 2015

1 commit

  • Commit 9c7077622dd91 ("packet: make packet_snd fail on len smaller
    than l2 header") added validation for the packet size in packet_snd.
    This change enforces that every packet needs a header (with at least
    hard_header_len bytes) plus a payload with at least one byte. Before
    this change the payload was optional.

    This fixes PPPoE connections which do not have a "Service" or
    "Host-Uniq" configured (which is violating the spec, but is still
    widely used in real-world setups). Those are currently failing with the
    following message: "pppd: packet size is too short (24
    Signed-off-by: David S. Miller

    Martin Blumenstingl
     

18 Nov, 2015

2 commits


16 Nov, 2015

5 commits

  • Since it's introduction in commit 69e3c75f4d54 ("net: TX_RING and
    packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW
    side. When used with SOCK_DGRAM only, the size_max > dev->mtu +
    reserve check should have reserve as 0, but currently, this is
    unconditionally set (in it's original form as dev->hard_header_len).

    I think this is not correct since tpacket_fill_skb() would then
    take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM,
    the extra VLAN_HLEN could be possible in both cases. Presumably, the
    reserve code was copied from packet_snd(), but later on missed the
    check. Make it similar as we have it in packet_snd().

    Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap")
    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In case no struct sockaddr_ll has been passed to packet
    socket's sendmsg() when doing a TX_RING flush run, then
    skb->protocol is set to po->num instead, which is the protocol
    passed via socket(2)/bind(2).

    Applications only xmitting can go the path of allocating the
    socket as socket(PF_PACKET, , 0) and do a bind(2) on the
    TX_RING with sll_protocol of 0. That way, register_prot_hook()
    is neither called on creation nor on bind time, which saves
    cycles when there's no interest in capturing anyway.

    That leaves us however with po->num 0 instead and therefore
    the TX_RING flush run sets skb->protocol to 0 as well. Eric
    reported that this leads to problems when using tools like
    trafgen over bonding device. I.e. the bonding's hash function
    could invoke the kernel's flow dissector, which depends on
    skb->protocol being properly set. In the current situation, all
    the traffic is then directed to a single slave.

    Fix it up by inferring skb->protocol from the Ethernet header
    when not set and we have ARPHRD_ETHER device type. This is only
    done in case of SOCK_RAW and where we have a dev->hard_header_len
    length. In case of ARPHRD_ETHER devices, this is guaranteed to
    cover ETH_HLEN, and therefore being accessed on the skb after
    the skb_store_bits().

    Reported-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Packet sockets can be used by various net devices and are not
    really restricted to ARPHRD_ETHER device types. However, when
    currently checking for the extra 4 bytes that can be transmitted
    in VLAN case, our assumption is that we generally probe on
    ARPHRD_ETHER devices. Therefore, before looking into Ethernet
    header, check the device type first.

    This also fixes the issue where non-ARPHRD_ETHER devices could
    have no dev->hard_header_len in TX_RING SOCK_RAW case, and thus
    the check would test unfilled linear part of the skb (instead
    of non-linear).

    Fixes: 57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.")
    Fixes: 52f1454f629f ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • We concluded that the skb_probe_transport_header() should better be
    called unconditionally. Avoiding the call into the flow dissector has
    also not really much to do with the direct xmit mode.

    While it seems that only virtio_net code makes use of GSO from non
    RX/TX ring packet socket paths, we should probe for a transport header
    nevertheless before they hit devices.

    Reference: http://thread.gmane.org/gmane.linux.network/386173/
    Signed-off-by: Daniel Borkmann
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In tpacket_fill_skb() commit c1aad275b029 ("packet: set transport
    header before doing xmit") and later on 40893fd0fd4e ("net: switch
    to use skb_probe_transport_header()") was probing for a transport
    header on the skb from a ring buffer slot, but at a time, where
    the skb has _not even_ been filled with data yet. So that call into
    the flow dissector is pretty useless. Lets do it after we've set
    up the skb frags.

    Fixes: c1aad275b029 ("packet: set transport header before doing xmit")
    Reported-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

06 Nov, 2015

1 commit

  • There is a race conditions between packet_notifier and packet_bind{_spkt}.

    It happens if packet_notifier(NETDEV_UNREGISTER) executes between the
    time packet_bind{_spkt} takes a reference on the new netdevice and the
    time packet_do_bind sets po->ifindex.
    In this case the notification can be missed.
    If this happens during a dev_change_net_namespace this can result in the
    netdevice to be moved to the new namespace while the packet_sock in the
    old namespace still holds a reference on it. When the netdevice is later
    deleted in the new namespace the deletion hangs since the packet_sock
    is not found in the new namespace' &net->packet.sklist.
    It can be reproduced with the script below.

    This patch makes packet_do_bind check again for the presence of the
    netdevice in the packet_sock's namespace after the synchronize_net
    in unregister_prot_hook.
    More in general it also uses the rcu lock for the duration of the bind
    to stop dev_change_net_namespace/rollback_registered_many from
    going past the synchronize_net following unlist_netdevice, so that
    no NETDEV_UNREGISTER notifications can happen on the new netdevice
    while the bind is executing. In order to do this some code from
    packet_bind{_spkt} is consolidated into packet_do_dev.

    import socket, os, time, sys
    proto=7
    realDev='em1'
    vlanId=400
    if len(sys.argv) > 1:
    vlanId=int(sys.argv[1])
    dev='vlan%d' % vlanId

    os.system('taskset -p 0x10 %d' % os.getpid())

    s = socket.socket(socket.PF_PACKET, socket.SOCK_RAW, proto)
    os.system('ip link add link %s name %s type vlan id %d' %
    (realDev, dev, vlanId))
    os.system('ip netns add dummy')

    pid=os.fork()

    if pid == 0:
    # dev should be moved while packet_do_bind is in synchronize net
    os.system('taskset -p 0x20000 %d' % os.getpid())
    os.system('ip link set %s netns dummy' % dev)
    os.system('ip netns exec dummy ip link del %s' % dev)
    s.close()
    sys.exit(0)

    time.sleep(.004)
    try:
    s.bind(('%s' % dev, proto+1))
    except:
    print 'Could not bind socket'
    s.close()
    os.system('ip netns del dummy')
    sys.exit(0)

    os.waitpid(pid, 0)
    s.close()
    os.system('ip netns del dummy')
    sys.exit(0)

    Signed-off-by: Francesco Ruggeri
    Signed-off-by: David S. Miller

    Francesco Ruggeri
     

13 Oct, 2015

3 commits

  • The function ip_defrag is called on both the input and the output
    paths of the networking stack. In particular conntrack when it is
    tracking outbound packets from the local machine calls ip_defrag.

    So add a struct net parameter and stop making ip_defrag guess which
    network namespace it needs to defragment packets in.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Recent TCP listener patches exposed a prior af_packet bug :
    match_fanout_group() blindly assumes it is always safe
    to cast sk to a packet socket to compare fanout with af_packet_priv

    But SYNACK packets can be sent while attached to request_sock, which
    are smaller than a "struct sock".

    We can read non existent memory and crash.

    Fixes: c0de08d04215 ("af_packet: don't emit packet on orig fanout group")
    Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Cc: Eric Leblond
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Edward Hyunkoo Jee
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Edward Jee
     

11 Oct, 2015

1 commit

  • eBPF socket filter programs may see junk in 'u32 cb[5]' area,
    since it could have been used by protocol layers earlier.

    For socket filter programs used in af_packet we need to clean
    20 bytes of skb->cb area if it could be used by the program.
    For programs attached to TCP/UDP sockets we need to save/restore
    these 20 bytes, since it's used by protocol layers.

    Remove SK_RUN_FILTER macro, since it's no longer used.

    Long term we may move this bpf cb area to per-cpu scratch, but that
    requires addition of new 'per-cpu load/store' instructions,
    so not suitable as a short term fix.

    Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields")
    Reported-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

05 Oct, 2015

1 commit

  • The current ongoing effort to dump existing cBPF seccomp filters back
    to user space requires to hold the pre-transformed instructions like
    we do in case of socket filters from sk_attach_filter() side, so they
    can be reloaded in original form at a later point in time by utilities
    such as criu.

    To prepare for this, simply extend the bpf_prog_create_from_user()
    API to hold a flag that tells whether we should store the original
    or not. Also, fanout filters could make use of that in future for
    things like diag. While fanout filters already use bpf_prog_destroy(),
    move seccomp over to them as well to handle original programs when
    present.

    Signed-off-by: Daniel Borkmann
    Cc: Tycho Andersen
    Cc: Pavel Emelyanov
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: Alexei Starovoitov
    Tested-by: Tycho Andersen
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

24 Sep, 2015

1 commit

  • Commit 7d82410950aa ("virtio: add explicit big-endian support to memory
    accessors") accidentally changed the virtio_net header used by
    AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

    Since virtio_legacy_is_little_endian() is a very long identifier,
    define a vio_le macro and use that throughout the code instead of the
    hard-coded 'false' for little-endian.

    This restores the ABI to match 4.1 and earlier kernels, and makes my
    test program work again.

    Signed-off-by: David Woodhouse
    Signed-off-by: David S. Miller

    David Woodhouse
     

18 Aug, 2015

2 commits

  • Add fanout mode PACKET_FANOUT_EBPF that accepts an en extended BPF
    program to select a socket.

    Update the internal eBPF program by passing to socket option
    SOL_PACKET/PACKET_FANOUT_DATA a file descriptor returned by bpf().

    Signed-off-by: Willem de Bruijn
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Add fanout mode PACKET_FANOUT_CBPF that accepts a classic BPF program
    to select a socket.

    This avoids having to keep adding special case fanout modes. One
    example use case is application layer load balancing. The QUIC
    protocol, for instance, encodes a connection ID in UDP payload.

    Also add socket option SOL_PACKET/PACKET_FANOUT_DATA that updates data
    associated with the socket group. Fanout mode PACKET_FANOUT_CBPF is the
    only user so far.

    Signed-off-by: Willem de Bruijn
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

01 Aug, 2015

1 commit


30 Jul, 2015

1 commit


29 Jul, 2015

1 commit

  • tpacket_fill_skb() can return a negative value (-errno) which
    is stored in tp_len variable. In that case the following
    condition will be (but shouldn't be) true:

    tp_len > dev->mtu + dev->hard_header_len

    as dev->mtu and dev->hard_header_len are both unsigned.

    That may lead to just returning an incorrect EMSGSIZE errno
    to the user.

    Fixes: 52f1454f629fa ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
    Signed-off-by: Alexander Drozdov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexander Drozdov
     

28 Jul, 2015

1 commit

  • When binding a PF_PACKET socket, the use count of the bound interface is
    always increased with dev_hold in dev_get_by_{index,name}. However,
    when rebound with the same protocol and device as in the previous bind
    the use count of the interface was not decreased. Ultimately, this
    caused the deletion of the interface to fail with the following message:

    unregister_netdevice: waiting for dummy0 to become free. Usage count = 1

    This patch moves the dev_put out of the conditional part that was only
    executed when either the protocol or device changed on a bind.

    Fixes: 902fefb82ef7 ('packet: improve socket create/bind latency in some cases')
    Signed-off-by: Lars Westerhoff
    Signed-off-by: Dan Carpenter
    Reviewed-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Lars Westerhoff
     

24 Jun, 2015

1 commit


23 Jun, 2015

1 commit

  • Remove handling of tx_ring in prb_setup_retire_blk_timer
    for TPACKET_V3 because init_prb_bdqc is called only for zero tx_ring
    and thus prb_setup_retire_blk_timer for zero tx_ring only.

    And also in functon init_prb_bdqc there is no usage of tx_ring.
    Thus removing tx_ring from init_prb_bdqc.

    Signed-off-by: Maninder Singh
    Suggested-by: Frans Klaver
    Signed-off-by: David S. Miller

    Maninder Singh
     

22 Jun, 2015

3 commits

  • PACKET_FANOUT_LB computes f->rr_cur such that it is modulo
    f->num_members. It returns the old value unconditionally, but
    f->num_members may have changed since the last store. Ensure
    that the return value is always < num.

    When modifying the logic, simplify it further by replacing the loop
    with an unconditional atomic increment.

    Fixes: dc99f600698d ("packet: Add fanout support.")
    Suggested-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Destruction of the po->rollover must be delayed until there are no
    more packets in flight that can access it. The field is destroyed in
    packet_release, before synchronize_net. Delay using rcu.

    Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")

    Suggested-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • We need to tell compiler it must not read f->num_members multiple
    times. Otherwise testing if num is not zero is flaky, and we could
    attempt an invalid divide by 0 in fanout_demux_cpu()

    Note bug was present in packet_rcv_fanout_hash() and
    packet_rcv_fanout_lb() but final 3.1 had a simple location
    after commit 95ec3eb417115fb ("packet: Add 'cpu' fanout policy.")

    Fixes: dc99f600698dc ("packet: Add fanout support.")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 May, 2015

1 commit

  • Rollover can be enabled as flag or mode. Allocate state in both cases.
    This solves a NULL pointer exception in fanout_demux_rollover on
    referencing po->rollover if using mode rollover.

    Also make sure that in rollover mode each silo is tried (contrary
    to rollover flag, where the main socket is excluded after an initial
    try_self).

    Tested:
    Passes tools/testing/net/psock_fanout.c, which tests both modes and
    flag. My previous tests were limited to bench_rollover, which only
    stresses the flag. The test now completes safely. it still gives an
    error for mode rollover, because it does not expect the new headroom
    (ROOM_NORMAL) requirement. I will send a separate patch to the test.

    Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")

    Signed-off-by: Willem de Bruijn

    ----

    I should have run this test and caught this before submission, of
    course. Apologies for the oversight.
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

15 May, 2015

1 commit

  • Avoid two xchg calls whose return values were unused, causing a
    warning on some architectures.

    The relevant variable is a hint and read without mutual exclusion.
    This fix makes all writers hold the receive_queue lock.

    Suggested-by: David S. Miller
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

14 May, 2015

4 commits

  • Rollover indicates exceptional conditions. Export a counter to inform
    socket owners of this state.

    If no socket with sufficient room is found, rollover fails. Also count
    these events.

    Finally, also count when flows are rolled over early thanks to huge
    flow detection, to validate its correctness.

    Tested:
    Read counters in bench_rollover on all other tests in the patchset

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Migrate flows from a socket to another socket in the fanout group not
    only when the socket is full. Start migrating huge flows early, to
    divert possible 4-tuple attacks without affecting normal traffic.

    Introduce fanout_flow_is_huge(). This detects huge flows, which are
    defined as taking up more than half the load. It does so cheaply, by
    storing the rxhashes of the N most recent packets. If over half of
    these are the same rxhash as the current packet, then drop it. This
    only protects against 4-tuple attacks. N is chosen to fit all data in
    a single cache line.

    Tested:
    Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

    lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
    cpu rx rx.k drop.k rollover r.huge r.failed
    0 14 14 0 0 0 0
    1 20 20 0 0 0 0
    2 16 16 0 0 0 0
    3 6168824 6168824 0 4867721 4867721 0
    4 4867741 4867741 0 0 0 0
    5 12 12 0 0 0 0
    6 15 15 0 0 0 0
    7 17 17 0 0 0 0

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Rollover has to call packet_rcv_has_room on sockets in the fanout
    group to find a socket to migrate to. This operation is expensive
    especially if the packet sockets use rings, when a lock has to be
    acquired.

    Avoid pounding on the lock by all sockets by temporarily marking a
    socket as "under memory pressure" when such pressure is detected.
    While set, only the socket owner may call packet_rcv_has_room on the
    socket. Once it detects normal conditions, it clears the flag. The
    socket is not used as a victim by any other socket in the meantime.

    Under reasonably balanced load, each socket writer frequently calls
    packet_rcv_has_room and clears its own pressure field. As a backup
    for when the socket is rarely written to, also clear the flag on
    reading (packet_recvmsg, packet_poll) if this can be done cheaply
    (i.e., without calling packet_rcv_has_room). This is only for
    edge cases.

    Tested:
    Ran bench_rollover: a process with 8 sockets in a single fanout
    group, each pinned to a single cpu that receives one nic recv
    interrupt. RPS and RFS are disabled. The benchmark uses packet
    rx_ring, which has to take a lock when determining whether a
    socket has room.

    Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
    uniformly across the packet sockets (and inserted an iptables
    rule to drop in PREROUTING to avoid protocol stack processing).

    Without this patch, all sockets try to migrate traffic to
    neighbors, causing lock contention when searching for a non-
    empty neighbor. The lock is the top 9 entries.

    perf record -a -g sleep 5

    - 17.82% bench_rollover [kernel.kallsyms] [k] _raw_spin_lock
    - _raw_spin_lock
    - 99.00% spin_lock
    + 81.77% packet_rcv_has_room.isra.41
    + 18.23% tpacket_rcv
    + 0.84% packet_rcv_has_room.isra.41
    + 5.20% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.15% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.14% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.12% ksoftirqd/7 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.12% ksoftirqd/5 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.10% ksoftirqd/4 [kernel.kallsyms] [k] _raw_spin_lock
    + 4.66% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock
    + 4.45% ksoftirqd/3 [kernel.kallsyms] [k] _raw_spin_lock
    + 1.55% bench_rollover [kernel.kallsyms] [k] packet_rcv_has_room.isra.41

    On net-next with this patch, this lock contention is no longer a
    top entry. Most time is spent in the actual read function. Next up
    are other locks:

    + 15.52% bench_rollover bench_rollover [.] reader
    + 4.68% swapper [kernel.kallsyms] [k] memcpy_erms
    + 2.77% swapper [kernel.kallsyms] [k] packet_lookup_frame.isra.51
    + 2.56% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms
    + 2.16% swapper [kernel.kallsyms] [k] tpacket_rcv
    + 1.93% swapper [kernel.kallsyms] [k] mlx4_en_process_rx_cq

    Looking closer at the remaining _raw_spin_lock, the cost of probing
    in rollover is now comparable to the cost of taking the lock later
    in tpacket_rcv.

    - 1.51% swapper [kernel.kallsyms] [k] _raw_spin_lock
    - _raw_spin_lock
    + 33.41% packet_rcv_has_room
    + 28.15% tpacket_rcv
    + 19.54% enqueue_to_backlog
    + 6.45% __free_pages_ok
    + 2.78% packet_rcv_fanout
    + 2.13% fanout_demux_rollover
    + 2.01% netif_receive_skb_internal

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Only migrate flows to sockets that have sufficient headroom, where
    sufficient is defined as having at least 25% empty space.

    The kernel has three different buffer types: a regular socket, a ring
    with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
    latter two do not expose a read pointer to the kernel, so headroom is
    not computed easily. All three needs a different implementation to
    estimate free space.

    Tested:
    Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

    bench_rollover has as many sockets as there are NIC receive queues
    in the system. Each socket is owned by a process that is pinned to
    one of the receive cpus. RFS is disabled. RPS is enabled with an
    identity mapping (cpu x -> cpu x), to count drops with softnettop.

    lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
    Press [Enter] to exit

    cpu rx rx.k drop.k rollover r.huge r.failed
    0 16 16 0 0 0 0
    1 21 21 0 0 0 0
    2 5227502 5227502 0 0 0 0
    3 18 18 0 0 0 0
    4 6083289 6083289 0 5227496 0 0
    5 22 22 0 0 0 0
    6 21 21 0 0 0 0
    7 9 9 0 0 0 0

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn