23 Dec, 2014

1 commit

  • Make TPACKET_V3 signal poll when block is closed rather than for every
    packet. Side effect is that poll will be signaled when block retire
    timer expires which didn't previously happen. Issue was visible when
    sending packets at a very low frequency such that all blocks are retired
    before packets are received by TPACKET_V3. This caused avoidable packet
    loss. The fix ensures that the signal is sent when blocks are closed
    which covers the normal path where the block is filled as well as the
    path where the timer expires. The case where a block is filled without
    moving to the next block (ie. all blocks are full) will still cause poll
    to be signaled.

    Signed-off-by: Dan Collins
    Signed-off-by: David S. Miller

    Dan Collins
     

12 Dec, 2014

1 commit

  • Pull networking updates from David Miller:

    1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

    2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers. Thanks to Al Viro
    and Herbert Xu.

    3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

    4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

    5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

    6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

    7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

    8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

    9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets. From Alexei
    Starovoitov.

    10) Support TSO/LSO in sunvnet driver, from David L Stevens.

    11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

    12) Remote checksum offload, from Tom Herbert.

    13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

    14) Add MPLS support to openvswitch, from Simon Horman.

    15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

    16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet. This tries to resolve the conflicting goals between the
    desired handling of bulk vs. RPC-like traffic.

    17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU. From Eric Dumazet.

    18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

    19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

    20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

    21) Add VLAN packet scheduler action, from Jiri Pirko.

    22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
    Fix race condition between vxlan_sock_add and vxlan_sock_release
    net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
    net/mlx4: Add support for A0 steering
    net/mlx4: Refactor QUERY_PORT
    net/mlx4_core: Add explicit error message when rule doesn't meet configuration
    net/mlx4: Add A0 hybrid steering
    net/mlx4: Add mlx4_bitmap zone allocator
    net/mlx4: Add a check if there are too many reserved QPs
    net/mlx4: Change QP allocation scheme
    net/mlx4_core: Use tasklet for user-space CQ completion events
    net/mlx4_core: Mask out host side virtualization features for guests
    net/mlx4_en: Set csum level for encapsulated packets
    be2net: Export tunnel offloads only when a VxLAN tunnel is created
    gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
    cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
    net: fec: only enable mdio interrupt before phy device link up
    net: fec: clear all interrupt events to support i.MX6SX
    net: fec: reset fep link status in suspend function
    net: sock: fix access via invalid file descriptor
    net: introduce helper macro for_each_cmsghdr
    ...

    Linus Torvalds
     

10 Dec, 2014

1 commit

  • Note that the code _using_ ->msg_iter at that point will be very
    unhappy with anything other than unshifted iovec-backed iov_iter.
    We still need to convert users to proper primitives.

    Signed-off-by: Al Viro

    Al Viro
     

09 Dec, 2014

1 commit


30 Nov, 2014

1 commit


25 Nov, 2014

1 commit

  • af_packet produces lots of these:
    net/packet/af_packet.c:384:39: warning: incorrect type in return expression (different modifiers)
    net/packet/af_packet.c:384:39: expected struct page [pure] *
    net/packet/af_packet.c:384:39: got struct page *

    this seems to be because sparse does not realize that _pure
    refers to function, not the returned pointer.

    Tweak code slightly to avoid the warning.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     

24 Nov, 2014

3 commits


22 Nov, 2014

1 commit

  • When sending packets out with PF_PACKET, SOCK_RAW, ensure that the
    packet is at least as long as the device's expected link layer header.
    This check already exists in tpacket_snd, but not in packet_snd.
    Also rate limit the warning in tpacket_snd.

    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

06 Nov, 2014

1 commit

  • This encapsulates all of the skb_copy_datagram_iovec() callers
    with call argument signature "skb, offset, msghdr->msg_iov, length".

    When we move to iov_iters in the networking, the iov_iter object will
    sit in the msghdr.

    Having a helper like this means there will be less places to touch
    during that transformation.

    Based upon descriptions and patch from Al Viro.

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Sep, 2014

2 commits


30 Aug, 2014

1 commit


25 Aug, 2014

1 commit


22 Aug, 2014

1 commit

  • af_packet can currently overwrite kernel memory by out of bound
    accesses, because it assumed a [new] block can always hold one frame.

    This is not generally the case, even if most existing tools do it right.

    This patch clamps too long frames as API permits, and issue a one time
    error on syslog.

    [ 394.357639] tpacket_rcv: packet too big, clamped from 5042 to 3966. macoff=82

    In this example, packet header tp_snaplen was set to 3966,
    and tp_len was set to 5042 (skb->len)

    Signed-off-by: Eric Dumazet
    Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
    Acked-by: Daniel Borkmann
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Jul, 2014

1 commit

  • No device driver will ever return an skb_shared_info structure with
    syststamp non-zero, so remove the branch that tests for this and
    optionally marks the packet timestamp as TP_STATUS_TS_SYS_HARDWARE.

    Do not remove the definition TP_STATUS_TS_SYS_HARDWARE, as processes
    may refer to it.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

16 Jul, 2014

1 commit


25 Apr, 2014

2 commits

  • It is possible by passing a netlink socket to a more privileged
    executable and then to fool that executable into writing to the socket
    data that happens to be valid netlink message to do something that
    privileged executable did not intend to do.

    To keep this from happening replace bare capable and ns_capable calls
    with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
    Which act the same as the previous calls except they verify that the
    opener of the socket had the desired permissions as well.

    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The permission check in sock_diag_put_filterinfo is wrong, and it is so removed
    from it's sources it is not clear why it is wrong. Move the computation
    into packet_diag_dump and pass a bool of the result into sock_diag_filterinfo.

    This does not yet correct the capability check but instead simply moves it to make
    it clear what is going on.

    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

23 Apr, 2014

1 commit


12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

04 Apr, 2014

2 commits

  • Currently, in packet_direct_xmit() we test the assigned netdevice queue
    for netif_xmit_frozen_or_stopped() before doing an ndo_start_xmit().

    This can have the side-effect that BQL enabled drivers which make use
    of netdev_tx_sent_queue() internally, set __QUEUE_STATE_STACK_XOFF from
    within the stack and would not fully fill the device's TX ring from
    packet sockets with PACKET_QDISC_BYPASS enabled.

    Instead, use a test without BQL bit so that bursts can be absorbed
    into the NICs TX ring. Fix and code suggested by Eric Dumazet, thanks!

    Signed-off-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Since commit 015f0688f57c ("net: net: add a core netdev->tx_dropped
    counter"), we can now account for TX drops from within the core
    stack instead of drivers.

    Therefore, fix packet_direct_xmit() and increase drop count when we
    encounter a problem before driver's xmit function was called (we do
    not want to doubly account for it).

    Suggested-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

29 Mar, 2014

1 commit

  • Quite often it can be useful to test with dummy or similar
    devices as a blackhole sink for skbs. Such devices are only
    equipped with a single txq, but marked as NETIF_F_LLTX as
    they do not require locking their internal queues on xmit
    (or implement locking themselves). Therefore, rather use
    HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will be respected.

    trafgen mmap/TX_RING example against dummy device with config
    foo: { fill(0xff, 64) } results in the following performance
    improvements for such scenarios on an ordinary Core i7/2.80GHz:

    Before:

    Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs):

    160,975,944,159 instructions:k # 0.55 insns per cycle ( +- 0.09% )
    293,319,390,278 cycles:k # 0.000 GHz ( +- 0.35% )
    192,501,104 branch-misses:k ( +- 1.63% )
    831 context-switches:k ( +- 9.18% )
    7 cpu-migrations:k ( +- 7.40% )
    69,382 cache-misses:k # 0.010 % of all cache refs ( +- 2.18% )
    671,552,021 cache-references:k ( +- 1.29% )

    22.856401569 seconds time elapsed ( +- 0.33% )

    After:

    Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs):

    133,788,739,692 instructions:k # 0.92 insns per cycle ( +- 0.06% )
    145,853,213,256 cycles:k # 0.000 GHz ( +- 0.17% )
    59,867,100 branch-misses:k ( +- 4.72% )
    384 context-switches:k ( +- 3.76% )
    6 cpu-migrations:k ( +- 6.28% )
    70,304 cache-misses:k # 0.077 % of all cache refs ( +- 1.73% )
    90,879,408 cache-references:k ( +- 1.35% )

    11.719372413 seconds time elapsed ( +- 0.24% )

    Signed-off-by: Daniel Borkmann
    Cc: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

27 Mar, 2014

1 commit

  • The packet hash can be considered a property of the packet, not just
    on RX path.

    This patch changes name of rxhash and l4_rxhash skbuff fields to be
    hash and l4_hash respectively. This includes changing uses of the
    field in the code which don't call the access functions.

    Signed-off-by: Tom Herbert
    Signed-off-by: Eric Dumazet
    Cc: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Tom Herbert
     

01 Mar, 2014

1 commit

  • Commit 57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes
    for VLAN packets.") added the possibility for non-mmaped frames to
    send extra 4 byte for VLAN header so the MTU increases from 1500 to
    1504 byte, for example.

    Commit cbd89acb9eb2 ("af_packet: fix for sending VLAN frames via
    packet_mmap") attempted to fix that for the mmap part but was
    reverted as it caused regressions while using eth_type_trans()
    on output path.

    Lets just act analogous to 57f89bfa2140 and add a similar logic
    to TX_RING. We presume size_max as overcharged with +4 bytes and
    later on after skb has been built by tpacket_fill_skb() check
    for ETH_P_8021Q header on packets larger than normal MTU. Can
    be easily reproduced with a slightly modified trafgen in mmap(2)
    mode, test cases:

    { fill(0xff, 12) const16(0x8100) fill(0xff, ) }
    { fill(0xff, 12) const16(0x0806) fill(0xff, ) }

    Note that we need to do the test right after tpacket_fill_skb()
    as sockets can have PACKET_LOSS set where we would not fail but
    instead just continue to traverse the ring.

    Reported-by: Mathias Kretschmer
    Signed-off-by: Daniel Borkmann
    Cc: Ben Greear
    Cc: Phil Sutter
    Tested-by: Mathias Kretschmer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

19 Feb, 2014

1 commit


17 Feb, 2014

1 commit

  • Mathias reported that on an AMD Geode LX embedded board (ALiX)
    with ath9k driver PACKET_QDISC_BYPASS, introduced in commit
    d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket
    option"), triggers a WARN_ON() coming from the driver itself
    via 066dae93bdf ("ath9k: rework tx queue selection and fix
    queue stopping/waking").

    The reason why this happened is that ndo_select_queue() call
    is not invoked from direct xmit path i.e. for ieee80211 subsystem
    that sets queue and TID (similar to 802.1d tag) which is being
    put into the frame through 802.11e (WMM, QoS). If that is not
    set, pending frame counter for e.g. ath9k can get messed up.

    So the WARN_ON() in ath9k is absolutely legitimate. Generally,
    the hw queue selection in ieee80211 depends on the type of
    traffic, and priorities are set according to ieee80211_ac_numbers
    mapping; working in a similar way as DiffServ only on a lower
    layer, so that the AP can favour frames that have "real-time"
    requirements like voice or video data frames.

    Therefore, check for presence of ndo_select_queue() in netdev
    ops and, if available, invoke it with a fallback handler to
    __packet_pick_tx_queue(), so that driver such as bnx2x, ixgbe,
    or mlx4 can still select a hw queue for transmission in
    relation to the current CPU while e.g. ieee80211 subsystem
    can make their own choices.

    Reported-by: Mathias Kretschmer
    Signed-off-by: Daniel Borkmann
    Cc: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

23 Jan, 2014

1 commit

  • This patch adds a queue mapping mode to the fanout operation of af_packet
    sockets. This allows user space af_packet users to better filter on flows
    ingressing and egressing via a specific hardware queue, and avoids the potential
    packet reordering that can occur when FANOUT_CPU is being used and irq affinity
    varies.

    Tested successfully by myself. applies to net-next

    Signed-off-by: Neil Horman
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

22 Jan, 2014

3 commits

  • As David Laight suggests, we shouldn't necessarily call this
    reciprocal_divide() when users didn't requested a reciprocal_value();
    lets keep the basic idea and call it reciprocal_scale(). More
    background information on this topic can be found in [1].

    Joint work with Hannes Frederic Sowa.

    [1] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html

    Suggested-by: David Laight
    Cc: Jakub Zawadzki
    Cc: Eric Dumazet
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Many functions have open coded a function that returns a random
    number in range [0,N-1]. Under the assumption that we have a PRNG
    such as taus113 with being well distributed in [0, ~0U] space,
    we can implement such a function as uword t = (n*m')>>32, where
    m' is a random number obtained from PRNG, n the right open interval
    border and t our resulting random number, with n,m',t in u32 universe.

    Lets go with Joe and simply call it prandom_u32_max(), although
    technically we have an right open interval endpoint, but that we
    have documented. Other users can further be migrated to the new
    prandom_u32_max() function later on; for now, we need to make sure
    to migrate reciprocal_divide() users for the reciprocal_divide()
    follow-up fixup since their function signatures are going to change.

    Joint work with Hannes Frederic Sowa.

    Cc: Jakub Zawadzki
    Cc: Eric Dumazet
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Doesn't bring much, but also doesn't hurt us to fix 'em:

    1) In tpacket_rcv() flush dcache page we can restirct the scope
    for start and end and remove one layer of indent.

    2) In tpacket_destruct_skb() we can restirct the scope for ph.

    3) In alloc_one_pg_vec_page() we can remove the NULL assignment
    and change spacing a bit.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

19 Jan, 2014

1 commit

  • This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
    handler msg_name and msg_namelen logic").

    DECLARE_SOCKADDR validates that the structure we use for writing the
    name information to is not larger than the buffer which is reserved
    for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
    consistently in sendmsg code paths.

    Signed-off-by: Steffen Hurrle
    Suggested-by: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Steffen Hurrle
     

17 Jan, 2014

3 commits

  • In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
    and one atomic_dec() call in skb destructor and use a percpu
    reference count instead in order to determine if packets are
    still pending to be sent out. Micro-benchmark with [1] that has
    been slightly modified (that is, protcol = 0 in socket(2) and
    bind(2)), example on a rather crappy testing machine; I expect
    it to scale and have even better results on bigger machines:

    ./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

    With patch: 4,022,015 cyc
    Without patch: 4,812,994 cyc

    time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

    With patch:
    real 1m32.241s
    user 0m0.287s
    sys 1m29.316s

    Without patch:
    real 1m38.386s
    user 0m0.265s
    sys 1m35.572s

    In function tpacket_snd(), it is okay to use packet_read_pending()
    since in fast-path we short-circuit the condition already with
    ph != NULL, since we have next frames to process. In case we have
    MSG_DONTWAIT, we also do not execute this path as need_wait is
    false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
    okay to call a packet_read_pending(), because when we ever reach
    that path, we're done processing outgoing frames anyway and only
    look if there are skbs still outstanding to be orphaned. We can
    stay lockless in this percpu counter since it's acceptable when we
    reach this path for the sum to be imprecise first, but we'll level
    out at 0 after all pending frames have reached the skb destructor
    eventually through tx reclaim. When people pin a tx process to
    particular CPUs, we expect overflows to happen in the reference
    counter as on one CPU we expect heavy increase; and distributed
    through ksoftirqd on all CPUs a decrease, for example. As
    David Laight points out, since the C language doesn't define the
    result of signed int overflow (i.e. rather than wrap, it is
    allowed to saturate as a possible outcome), we have to use
    unsigned int as reference count. The sum over all CPUs when tx
    is complete will result in 0 again.

    The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
    can _only_ be set from inside tpacket_snd() path and we made sure
    to increase tx_ring.pending in any case before we called po->xmit(skb).
    So testing for tx_ring.pending == 0 is not too useful. Instead, it
    would rather have been useful to test if lower layers didn't orphan
    the skb so that we're missing ring slots being put back to
    TP_STATUS_AVAILABLE. But such a bug will be caught in user space
    already as we end up realizing that we do not have any
    TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

    Btw, in case of RX_RING path, we do not make use of the pending
    member, therefore we also don't need to use up any percpu memory
    here. Also note that __alloc_percpu() already returns a zero-filled
    percpu area, so initialization is done already.

    [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In tpacket_snd(), when we've discovered a first frame that is
    not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
    we exit the send routine in case of MSG_DONTWAIT, since we've
    finished traversing the mmaped send ring buffer and don't care
    about pending frames.

    While doing so, we still unconditionally call an expensive
    schedule() in the packet_current_frame() "error" path, which
    is unnecessary in this case since it's enough to just quit
    the function.

    Also, in case MSG_DONTWAIT is not set, we should rather test
    for need_resched() first and do schedule() only if necessary
    since meanwhile pending frames could already have finished
    processing and called skb destructor.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Most people acquire PF_PACKET sockets with a protocol argument in
    the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
    all its sockets. Most likely, at some point in time a subsequent
    bind() call will follow, e.g. in libpcap with ...

    memset(&sll, 0, sizeof(sll));
    sll.sll_family = AF_PACKET;
    sll.sll_ifindex = ifindex;
    sll.sll_protocol = htons(ETH_P_ALL);

    ... as arguments. What happens in the kernel is that already
    in socket() syscall, we install a proto hook via register_prot_hook()
    if our protocol argument is != 0. Yet, in bind() we're almost
    doing the same work by doing a unregister_prot_hook() with an
    expensive synchronize_net() call in case during socket() the proto
    was != 0, plus follow-up register_prot_hook() with a bound device
    to it this time, in order to limit traffic we get.

    In the case when the protocol and user supplied device index (== 0)
    does not change from socket() to bind(), we can spare us doing
    the same work twice. Similarly for re-binding to the same device
    and protocol. For these scenarios, we can decrease create/bind
    latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
    with this patch.

    Alternatively, for the first case, if people care, they should
    simply create their sockets with proto == 0 argument and define
    the protocol during bind() as this saves a call to synchronize_net()
    as well (sock-bind-3 case).

    In all other cases, we're tied to user space behaviour we must not
    change, also since a bind() is not strictly required. Thus, we need
    the synchronize_net() to make sure no asynchronous packet processing
    paths still refer to the previous elements of po->prot_hook.

    In case of mmap()ed sockets, the workflow that includes bind() is
    socket() -> setsockopt() -> bind(). In that case, a pair of
    {__unregister, register}_prot_hook is being called from setsockopt()
    in order to install the new protocol receive handler. Thus, when
    we call bind and can skip a re-hook, we have already previously
    installed the new handler. For fanout, this is handled different
    entirely, so we should be good.

    Timings on an i7-3520M machine:

    * sock-bind-1: 89 us
    * sock-bind-2: 7447 us
    * sock-bind-3: 75 us

    sock-bind-1:
    socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
    bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
    pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

    sock-bind-2:
    socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
    bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
    pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

    sock-bind-3:
    socket(PF_PACKET, SOCK_RAW, 0) = 3
    bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
    pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

01 Jan, 2014

1 commit


18 Dec, 2013

2 commits