18 Aug, 2015

1 commit

  • Add fanout mode PACKET_FANOUT_CBPF that accepts a classic BPF program
    to select a socket.

    This avoids having to keep adding special case fanout modes. One
    example use case is application layer load balancing. The QUIC
    protocol, for instance, encodes a connection ID in UDP payload.

    Also add socket option SOL_PACKET/PACKET_FANOUT_DATA that updates data
    associated with the socket group. Fanout mode PACKET_FANOUT_CBPF is the
    only user so far.

    Signed-off-by: Willem de Bruijn
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

22 Jun, 2015

1 commit

  • Destruction of the po->rollover must be delayed until there are no
    more packets in flight that can access it. The field is destroyed in
    packet_release, before synchronize_net. Delay using rcu.

    Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")

    Suggested-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

14 May, 2015

4 commits

  • Rollover indicates exceptional conditions. Export a counter to inform
    socket owners of this state.

    If no socket with sufficient room is found, rollover fails. Also count
    these events.

    Finally, also count when flows are rolled over early thanks to huge
    flow detection, to validate its correctness.

    Tested:
    Read counters in bench_rollover on all other tests in the patchset

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Migrate flows from a socket to another socket in the fanout group not
    only when the socket is full. Start migrating huge flows early, to
    divert possible 4-tuple attacks without affecting normal traffic.

    Introduce fanout_flow_is_huge(). This detects huge flows, which are
    defined as taking up more than half the load. It does so cheaply, by
    storing the rxhashes of the N most recent packets. If over half of
    these are the same rxhash as the current packet, then drop it. This
    only protects against 4-tuple attacks. N is chosen to fit all data in
    a single cache line.

    Tested:
    Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

    lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
    cpu rx rx.k drop.k rollover r.huge r.failed
    0 14 14 0 0 0 0
    1 20 20 0 0 0 0
    2 16 16 0 0 0 0
    3 6168824 6168824 0 4867721 4867721 0
    4 4867741 4867741 0 0 0 0
    5 12 12 0 0 0 0
    6 15 15 0 0 0 0
    7 17 17 0 0 0 0

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Rollover has to call packet_rcv_has_room on sockets in the fanout
    group to find a socket to migrate to. This operation is expensive
    especially if the packet sockets use rings, when a lock has to be
    acquired.

    Avoid pounding on the lock by all sockets by temporarily marking a
    socket as "under memory pressure" when such pressure is detected.
    While set, only the socket owner may call packet_rcv_has_room on the
    socket. Once it detects normal conditions, it clears the flag. The
    socket is not used as a victim by any other socket in the meantime.

    Under reasonably balanced load, each socket writer frequently calls
    packet_rcv_has_room and clears its own pressure field. As a backup
    for when the socket is rarely written to, also clear the flag on
    reading (packet_recvmsg, packet_poll) if this can be done cheaply
    (i.e., without calling packet_rcv_has_room). This is only for
    edge cases.

    Tested:
    Ran bench_rollover: a process with 8 sockets in a single fanout
    group, each pinned to a single cpu that receives one nic recv
    interrupt. RPS and RFS are disabled. The benchmark uses packet
    rx_ring, which has to take a lock when determining whether a
    socket has room.

    Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
    uniformly across the packet sockets (and inserted an iptables
    rule to drop in PREROUTING to avoid protocol stack processing).

    Without this patch, all sockets try to migrate traffic to
    neighbors, causing lock contention when searching for a non-
    empty neighbor. The lock is the top 9 entries.

    perf record -a -g sleep 5

    - 17.82% bench_rollover [kernel.kallsyms] [k] _raw_spin_lock
    - _raw_spin_lock
    - 99.00% spin_lock
    + 81.77% packet_rcv_has_room.isra.41
    + 18.23% tpacket_rcv
    + 0.84% packet_rcv_has_room.isra.41
    + 5.20% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.15% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.14% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.12% ksoftirqd/7 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.12% ksoftirqd/5 [kernel.kallsyms] [k] _raw_spin_lock
    + 5.10% ksoftirqd/4 [kernel.kallsyms] [k] _raw_spin_lock
    + 4.66% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock
    + 4.45% ksoftirqd/3 [kernel.kallsyms] [k] _raw_spin_lock
    + 1.55% bench_rollover [kernel.kallsyms] [k] packet_rcv_has_room.isra.41

    On net-next with this patch, this lock contention is no longer a
    top entry. Most time is spent in the actual read function. Next up
    are other locks:

    + 15.52% bench_rollover bench_rollover [.] reader
    + 4.68% swapper [kernel.kallsyms] [k] memcpy_erms
    + 2.77% swapper [kernel.kallsyms] [k] packet_lookup_frame.isra.51
    + 2.56% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms
    + 2.16% swapper [kernel.kallsyms] [k] tpacket_rcv
    + 1.93% swapper [kernel.kallsyms] [k] mlx4_en_process_rx_cq

    Looking closer at the remaining _raw_spin_lock, the cost of probing
    in rollover is now comparable to the cost of taking the lock later
    in tpacket_rcv.

    - 1.51% swapper [kernel.kallsyms] [k] _raw_spin_lock
    - _raw_spin_lock
    + 33.41% packet_rcv_has_room
    + 28.15% tpacket_rcv
    + 19.54% enqueue_to_backlog
    + 6.45% __free_pages_ok
    + 2.78% packet_rcv_fanout
    + 2.13% fanout_demux_rollover
    + 2.01% netif_receive_skb_internal

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Replace rollover state per fanout group with state per socket. Future
    patches will add fields to the new structure.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

13 Mar, 2015

1 commit

  • Having to say
    > #ifdef CONFIG_NET_NS
    > struct net *net;
    > #endif

    in structures is a little bit wordy and a little bit error prone.

    Instead it is possible to say:
    > typedef struct {
    > #ifdef CONFIG_NET_NS
    > struct net *net;
    > #endif
    > } possible_net_t;

    And then in a header say:

    > possible_net_t net;

    Which is cleaner and easier to use and easier to test, as the
    possible_net_t is always there no matter what the compile options.

    Further this allows read_pnet and write_pnet to be functions in all
    cases which is better at catching typos.

    This change adds possible_net_t, updates the definitions of read_pnet
    and write_pnet, updates optional struct net * variables that
    write_pnet uses on to have the type possible_net_t, and finally fixes
    up the b0rked users of read_pnet and write_pnet.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

22 Aug, 2014

1 commit

  • af_packet can currently overwrite kernel memory by out of bound
    accesses, because it assumed a [new] block can always hold one frame.

    This is not generally the case, even if most existing tools do it right.

    This patch clamps too long frames as API permits, and issue a one time
    error on syslog.

    [ 394.357639] tpacket_rcv: packet too big, clamped from 5042 to 3966. macoff=82

    In this example, packet header tp_snaplen was set to 3966,
    and tp_len was set to 5042 (skb->len)

    Signed-off-by: Eric Dumazet
    Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
    Acked-by: Daniel Borkmann
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jan, 2014

1 commit

  • In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
    and one atomic_dec() call in skb destructor and use a percpu
    reference count instead in order to determine if packets are
    still pending to be sent out. Micro-benchmark with [1] that has
    been slightly modified (that is, protcol = 0 in socket(2) and
    bind(2)), example on a rather crappy testing machine; I expect
    it to scale and have even better results on bigger machines:

    ./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

    With patch: 4,022,015 cyc
    Without patch: 4,812,994 cyc

    time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

    With patch:
    real 1m32.241s
    user 0m0.287s
    sys 1m29.316s

    Without patch:
    real 1m38.386s
    user 0m0.265s
    sys 1m35.572s

    In function tpacket_snd(), it is okay to use packet_read_pending()
    since in fast-path we short-circuit the condition already with
    ph != NULL, since we have next frames to process. In case we have
    MSG_DONTWAIT, we also do not execute this path as need_wait is
    false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
    okay to call a packet_read_pending(), because when we ever reach
    that path, we're done processing outgoing frames anyway and only
    look if there are skbs still outstanding to be orphaned. We can
    stay lockless in this percpu counter since it's acceptable when we
    reach this path for the sum to be imprecise first, but we'll level
    out at 0 after all pending frames have reached the skb destructor
    eventually through tx reclaim. When people pin a tx process to
    particular CPUs, we expect overflows to happen in the reference
    counter as on one CPU we expect heavy increase; and distributed
    through ksoftirqd on all CPUs a decrease, for example. As
    David Laight points out, since the C language doesn't define the
    result of signed int overflow (i.e. rather than wrap, it is
    allowed to saturate as a possible outcome), we have to use
    unsigned int as reference count. The sum over all CPUs when tx
    is complete will result in 0 again.

    The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
    can _only_ be set from inside tpacket_snd() path and we made sure
    to increase tx_ring.pending in any case before we called po->xmit(skb).
    So testing for tx_ring.pending == 0 is not too useful. Instead, it
    would rather have been useful to test if lower layers didn't orphan
    the skb so that we're missing ring slots being put back to
    TP_STATUS_AVAILABLE. But such a bug will be caught in user space
    already as we end up realizing that we do not have any
    TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

    Btw, in case of RX_RING path, we do not make use of the pending
    member, therefore we also don't need to use up any percpu memory
    here. Also note that __alloc_percpu() already returns a zero-filled
    percpu area, so initialization is done already.

    [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

10 Dec, 2013

1 commit

  • This patch introduces a PACKET_QDISC_BYPASS socket option, that
    allows for using a similar xmit() function as in pktgen instead
    of taking the dev_queue_xmit() path. This can be very useful when
    PF_PACKET applications are required to be used in a similar
    scenario as pktgen, but with full, flexible packet payload that
    needs to be provided, for example.

    On default, nothing changes in behaviour for normal PF_PACKET
    TX users, so everything stays as is for applications. New users,
    however, can now set PACKET_QDISC_BYPASS if needed to prevent
    own packets from i) reentering packet_rcv() and ii) to directly
    push the frame to the driver.

    In doing so we can increase pps (here 64 byte packets) for
    PF_PACKET a bit:

    # CPUs -- QDISC_BYPASS -- qdisc path -- qdisc path[**]
    1 CPU == 1,509,628 pps -- 1,208,708 -- 1,247,436
    2 CPUs == 3,198,659 pps -- 2,536,012 -- 1,605,779
    3 CPUs == 4,787,992 pps -- 3,788,740 -- 1,735,610
    4 CPUs == 6,173,956 pps -- 4,907,799 -- 1,909,114
    5 CPUs == 7,495,676 pps -- 5,956,499 -- 2,014,422
    6 CPUs == 9,001,496 pps -- 7,145,064 -- 2,155,261
    7 CPUs == 10,229,776 pps -- 8,190,596 -- 2,220,619
    8 CPUs == 11,040,732 pps -- 9,188,544 -- 2,241,879
    9 CPUs == 12,009,076 pps -- 10,275,936 -- 2,068,447
    10 CPUs == 11,380,052 pps -- 11,265,337 -- 1,578,689
    11 CPUs == 11,672,676 pps -- 11,845,344 -- 1,297,412
    [...]
    20 CPUs == 11,363,192 pps -- 11,014,933 -- 1,245,081

    [**]: qdisc path with packet_rcv(), how probably most people
    seem to use it (hopefully not anymore if not needed)

    The test was done using a modified trafgen, sending a simple
    static 64 bytes packet, on all CPUs. The trick in the fast
    "qdisc path" case, is to avoid reentering packet_rcv() by
    setting the RAW socket protocol to zero, like:
    socket(PF_PACKET, SOCK_RAW, 0);

    Tradeoffs are documented as well in this patch, clearly, if
    queues are busy, we will drop more packets, tc disciplines are
    ignored, and these packets are not visible to taps anymore. For
    a pktgen like scenario, we argue that this is acceptable.

    The pointer to the xmit function has been placed in packet
    socket structure hole between cached_dev and prot_hook that
    is hot anyway as we're working on cached_dev in each send path.

    Done in joint work together with Jesper Dangaard Brouer.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

22 Nov, 2013

1 commit

  • Salam reported a use after free bug in PF_PACKET that occurs when
    we're sending out frames on a socket bound device and suddenly the
    net device is being unregistered. It appears that commit 827d9780
    introduced a possible race condition between {t,}packet_snd() and
    packet_notifier(). In the case of a bound socket, packet_notifier()
    can drop the last reference to the net_device and {t,}packet_snd()
    might end up suddenly sending a packet over a freed net_device.

    To avoid reverting 827d9780 and thus introducing a performance
    regression compared to the current state of things, we decided to
    hold a cached RCU protected pointer to the net device and maintain
    it on write side via bind spin_lock protected register_prot_hook()
    and __unregister_prot_hook() calls.

    In {t,}packet_snd() path, we access this pointer under rcu_read_lock
    through packet_cached_dev_get() that holds reference to the device
    to prevent it from being freed through packet_notifier() while
    we're in send path. This is okay to do as dev_put()/dev_hold() are
    per-cpu counters, so this should not be a performance issue. Also,
    the code simplifies a bit as we don't need need_rls_dev anymore.

    Fixes: 827d978037d7 ("af-packet: Use existing netdev reference for bound sockets.")
    Reported-by: Salam Noureddine
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Salam Noureddine
    Cc: Ben Greear
    Cc: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

25 Apr, 2013

2 commits

  • Currently, packet_sock has a struct tpacket_stats stats member for
    TPACKET_V1 and TPACKET_V2 statistic accounting, and with TPACKET_V3
    ``union tpacket_stats_u stats_u'' was introduced, where however only
    statistics for TPACKET_V3 are held, and when copied to user space,
    TPACKET_V3 does some hackery and access also tpacket_stats' stats,
    although everything could have been done within the union itself.

    Unify accounting within the tpacket_stats_u union so that we can
    remove 8 bytes from packet_sock that are there unnecessary. Note that
    even if we switch to TPACKET_V3 and would use non mmap(2)ed option,
    this still works due to the union with same types + offsets, that are
    exposed to the user space.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • There's a 4 byte hole in packet_ring_buffer structure before
    prb_bdqc, that can be filled with 'pending' member, thus we can
    reduce the overall structure size from 224 bytes to 216 bytes.
    This also has the side-effect, that in struct packet_sock 2*4 byte
    holes after the embedded packet_ring_buffer members are removed,
    and overall, packet_sock can be reduced by 1 cacheline:

    Before: size: 1344, cachelines: 21, members: 24
    After: size: 1280, cachelines: 20, members: 24

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

20 Mar, 2013

1 commit

  • Changes:
    v3->v2: rebase (no other changes)
    passes selftest
    v2->v1: read f->num_members only once
    fix bug: test rollover mode + flag

    Minimize packet drop in a fanout group. If one socket is full,
    roll over packets to another from the group. Maintain flow
    affinity during normal load using an rxhash fanout policy, while
    dispersing unexpected traffic storms that hit a single cpu, such
    as spoofed-source DoS flows. Rollover breaks affinity for flows
    arriving at saturated sockets during those conditions.

    The patch adds a fanout policy ROLLOVER that rotates between sockets,
    filling each socket before moving to the next. It also adds a fanout
    flag ROLLOVER. If passed along with any other fanout policy, the
    primary policy is applied until the chosen socket is full. Then,
    rollover selects another socket, to delay packet drop until the
    entire system is saturated.

    Probing sockets is not free. Selecting the last used socket, as
    rollover does, is a greedy approach that maximizes chance of
    success, at the cost of extreme load imbalance. In practice, with
    sufficiently long queues to absorb bursts, sockets are drained in
    parallel and load balance looks uniform in `top`.

    To avoid contention, scales counters with number of sockets and
    accesses them lockfree. Values are bounds checked to ensure
    correctness.

    Tested using an application with 9 threads pinned to CPUs, one socket
    per thread and sufficient busywork per packet operation to limits each
    thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
    packets, a FANOUT_CPU setup processes 32 Kpps in total without this
    patch, 270 Kpps with the patch. Tested with read() and with a packet
    ring (V1).

    Also, passes psock_fanout.c unit test added to selftests.

    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

08 Nov, 2012

1 commit

  • The tx data offset of packet mmap tx ring used to be :
    (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll))

    The problem is that, with SOCK_RAW socket, the payload (14 bytes after
    the beginning of the user data) is misaligned.

    This patch allows to let the user gives an offset for it's tx data if
    he desires.

    Set sock option PACKET_TX_HAS_OFF to 1, then specify in each frame of
    your tx ring tp_net for SOCK_DGRAM, or tp_mac for SOCK_RAW.

    Signed-off-by: Paul Chavent
    Signed-off-by: David S. Miller

    Paul Chavent
     

20 Aug, 2012

1 commit

  • Reported value is the same reported by the FANOUT getsockoption, but
    unlike it, the absent fanout setup results in absent nlattr, rather
    than in nlattr with zero value. This is done so, since zero fanout
    report may mean both -- no fanout, and fanout with both id and type zero.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

15 Aug, 2012

1 commit