18 Oct, 2018

1 commit

  • [ Upstream commit 7e823644b60555f70f241274b8d0120dd919269a ]

    Commit 2276f58ac589 ("udp: use a separate rx queue for packet reception")
    turned static inline __skb_recv_udp() from being a trivial helper around
    __skb_recv_datagram() into a UDP specific implementaion, making it
    EXPORT_SYMBOL_GPL() at the same time.

    There are external modules that got broken by __skb_recv_udp() not being
    visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().

    Rationale (one of those) why this is actually "technically correct" thing
    to do: __skb_recv_udp() used to be an inline wrapper around
    __skb_recv_datagram(), which itself (still, and correctly so, I believe)
    is EXPORT_SYMBOL().

    Cc: Paolo Abeni
    Cc: Eric Dumazet
    Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception")
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     

29 Sep, 2018

1 commit

  • [ Upstream commit 2b5a921740a55c00223a797d075b9c77c42cb171 ]

    commit 2abb7cdc0dc8 ("udp: Add support for doing checksum
    unnecessary conversion") left out the early demux path for
    connected sockets. As a result IP_CMSG_CHECKSUM gives wrong
    values for such socket when GRO is not enabled/available.

    This change addresses the issue by moving the csum conversion to a
    common helper and using such helper in both the default and the
    early demux rx path.

    Fixes: 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     

26 Jun, 2018

1 commit

  • [ Upstream commit 6c206b20092a3623184cff9470dba75d21507874 ]

    After commit 6b229cf77d68 ("udp: add batching to udp_rmem_release()")
    the sk_rmem_alloc field does not measure exactly anymore the
    receive queue length, because we batch the rmem release. The issue
    is really apparent only after commit 0d4a6608f68c ("udp: do rmem bulk
    free even if the rx sk queue is empty"): the user space can easily
    check for an empty socket with not-0 queue length reported by the 'ss'
    tool or the procfs interface.

    We need to use a custom UDP helper to report the correct queue length,
    taking into account the forward allocation deficit.

    Reported-by: trevor.francis@46labs.com
    Fixes: 6b229cf77d68 ("UDP: add batching to udp_rmem_release()")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     

19 May, 2018

2 commits

  • [ Upstream commit 69678bcd4d2dedbc3e8fcd6d7d99f283d83c531a ]

    Damir reported a breakage of SO_BINDTODEVICE for UDP sockets.
    In absence of VRF devices, after commit fb74c27735f0 ("net:
    ipv4: add second dif to udp socket lookups") the dif mismatch
    isn't fatal anymore for UDP socket lookup with non null
    sk_bound_dev_if, breaking SO_BINDTODEVICE semantics.

    This changeset addresses the issue making the dif match mandatory
    again in the above scenario.

    Reported-by: Damir Mansurov
    Fixes: fb74c27735f0 ("net: ipv4: add second dif to udp socket lookups")
    Fixes: 1801b570dd2a ("net: ipv6: add second dif to udp socket lookups")
    Signed-off-by: Paolo Abeni
    Acked-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit 1b97013bfb11d66f041de691de6f0fec748ce016 ]

    Fix more memory leaks in ip_cmsg_send() callers. Part of them were fixed
    earlier in 919483096bfe.

    * udp_sendmsg one was there since the beginning when linux sources were
    first added to git;
    * ping_v4_sendmsg one was copy/pasted in c319b4d76b9e.

    Whenever return happens in udp_sendmsg() or ping_v4_sendmsg() IP options
    have to be freed if they were allocated previously.

    Add label so that future callers (if any) can use it instead of kfree()
    before return that is easy to forget.

    Fixes: c319b4d76b9e (net: ipv4: add IPPROTO_ICMP socket kind)
    Signed-off-by: Andrey Ignatov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ignatov
     

09 Mar, 2018

1 commit

  • [ Upstream commit 15f35d49c93f4fa9875235e7bf3e3783d2dd7a1b ]

    Since UDP-Lite is always using checksum, the following path is
    triggered when calculating pseudo header for it:

    udp4_csum_init() or udp6_csum_init()
    skb_checksum_init_zero_check()
    __skb_checksum_validate_complete()

    The problem can appear if skb->len is less than CHECKSUM_BREAK. In
    this particular case __skb_checksum_validate_complete() also invokes
    __skb_checksum_complete(skb). If UDP-Lite is using partial checksum
    that covers only part of a packet, the function will return bad
    checksum and the packet will be dropped.

    It can be fixed if we skip skb_checksum_init_zero_check() and only
    set the required pseudo header checksum for UDP-Lite with partial
    checksum before udp4_csum_init()/udp6_csum_init() functions return.

    Fixes: ed70fcfcee95 ("net: Call skb_checksum_init in IPv4")
    Fixes: e4f45b7f40bd ("net: Call skb_checksum_init in IPv6")
    Signed-off-by: Alexey Kodanev
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kodanev
     

22 Oct, 2017

1 commit

  • Syzkaller stumbled upon a way to trigger
    WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41
    reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39

    There are two initialization paths for the sock_reuseport structure in a
    socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through
    SO_ATTACH_REUSEPORT_[CE]BPF before bind. The existing implementation
    assumedthat the socket lock protected both of these paths when it actually
    only protects the SO_ATTACH_REUSEPORT path. Syzkaller triggered this
    double allocation by running these paths concurrently.

    This patch moves the check for double allocation into the reuseport_alloc
    function which is protected by a global spin lock.

    Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
    Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")
    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

21 Oct, 2017

1 commit


10 Oct, 2017

1 commit

  • The commit bc044e8db796 ("udp: perform source validation for
    mcast early demux") does not take into account that broadcast packets
    lands in the same code path and they need different checks for the
    source address - notably, zero source address are valid for bcast
    and invalid for mcast.

    As a result, 2nd and later broadcast packets with 0 source address
    landing to the same socket are dropped. This breaks dhcp servers.

    Since we don't have stringent performance requirements for ingress
    broadcast traffic, fix it by disabling UDP early demux such traffic.

    Reported-by: Hannes Frederic Sowa
    Fixes: bc044e8db796 ("udp: perform source validation for mcast early demux")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

01 Oct, 2017

2 commits

  • The UDP early demux can leverate the rx dst cache even for
    multicast unconnected sockets.

    In such scenario the ipv4 source address is validated only on
    the first packet in the given flow. After that, when we fetch
    the dst entry from the socket rx cache, we stop enforcing
    the rp_filter and we even start accepting any kind of martian
    addresses.

    Disabling the dst cache for unconnected multicast socket will
    cause large performace regression, nearly reducing by half the
    max ingress tput.

    Instead we factor out a route helper to completely validate an
    skb source address for multicast packets and we call it from
    the UDP early demux for mcast packets landing on unconnected
    sockets, after successful fetching the related cached dst entry.

    This still gives a measurable, but limited performance
    regression:

    rp_filter = 0 rp_filter = 1
    edmux disabled: 1182 Kpps 1127 Kpps
    edmux before: 2238 Kpps 2238 Kpps
    edmux after: 2037 Kpps 2019 Kpps

    The above figures are on top of current net tree.
    Applying the net-next commit 6e617de84e87 ("net: avoid a full
    fib lookup when rp_filter is disabled.") the delta with
    rp_filter == 0 will decrease even more.

    Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Currently no error is emitted, but this infrastructure will
    used by the next patch to allow source address validation
    for mcast sockets.
    Since early demux can do a route lookup and an ipv4 route
    lookup can return an error code this is consistent with the
    current ipv4 route infrastructure.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

08 Sep, 2017

1 commit

  • After commit 0ddf3fb2c43d ("udp: preserve skb->dst if required
    for IP options processing") we clear the skb head state as soon
    as the skb carrying them is first processed.

    Since the same skb can be processed several times when MSG_PEEK
    is used, we can end up lacking the required head states, and
    eventually oopsing.

    Fix this clearing the skb head state only when processing the
    last skb reference.

    Reported-by: Eric Dumazet
    Fixes: 0ddf3fb2c43d ("udp: preserve skb->dst if required for IP options processing")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

07 Sep, 2017

1 commit

  • Pull networking updates from David Miller:

    1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

    2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

    3) Allow generic XDP to work on virtual devices, from John Fastabend.

    4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

    5) Remove UFO offloads from the tree, gave us little other than bugs.

    6) Remove the IPSEC flow cache, from Florian Westphal.

    7) Support ipv6 route offload in mlxsw driver.

    8) Support VF representors in bnxt_en, from Sathya Perla.

    9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

    10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

    11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

    12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

    13) Add new jump instructions to eBPF, from Daniel Borkmann.

    14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

    15) Support XDP in tap driver, from Jason Wang.

    16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

    17) Add Huawei hinic ethernet driver.

    18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
    i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
    i40e: avoid NVM acquire deadlock during NVM update
    drivers: net: xgene: Remove return statement from void function
    drivers: net: xgene: Configure tx/rx delay for ACPI
    drivers: net: xgene: Read tx/rx delay for ACPI
    rocker: fix kcalloc parameter order
    rds: Fix non-atomic operation on shared flag variable
    net: sched: don't use GFP_KERNEL under spin lock
    vhost_net: correctly check tx avail during rx busy polling
    net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
    rxrpc: Make service connection lookup always check for retry
    net: stmmac: Delete dead code for MDIO registration
    gianfar: Fix Tx flow control deactivation
    cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
    cxgb4: Fix pause frame count in t4_get_port_stats
    cxgb4: fix memory leak
    tun: rename generic_xdp to skb_xdp
    tun: reserve extra headroom only when XDP is set
    net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
    net: dsa: bcm_sf2: Advertise number of egress queues
    ...

    Linus Torvalds
     

04 Sep, 2017

1 commit


02 Sep, 2017

2 commits


26 Aug, 2017

1 commit

  • Currently, in the udp6 code, the dst cookie is not initialized/updated
    concurrently with the RX dst used by early demux.

    As a result, the dst_check() in the early_demux path always fails,
    the rx dst cache is always invalidated, and we can't really
    leverage significant gain from the demux lookup.

    Fix it adding udp6 specific variant of sk_rx_dst_set() and use it
    to set the dst cookie when the dst entry is really changed.

    The issue is there since the introduction of early demux for ipv6.

    Fixes: 5425077d73e0 ("net: ipv6: Add early demux handler for UDP unicast")
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

25 Aug, 2017

1 commit


23 Aug, 2017

1 commit

  • Remove two references to ufo in the udp send path that are no longer
    reachable now that ufo has been removed.

    Commit 85f1bd9a7b5a ("udp: consistently apply ufo or fragmentation")
    is a fix to ufo. It is safe to revert what remains of it.

    Also, no skb can enter ip_append_page with skb_is_gso true now that
    skb_shinfo(skb)->gso_type is no longer set in ip_append_page/_data.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

22 Aug, 2017

1 commit


19 Aug, 2017

1 commit

  • Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
    headers from UDP packets before queueing"), when udp packets are being
    peeked the requested extra offset is always 0 as there is no need to skip
    the udp header. However, when the offset is 0 and the next skb is
    of length 0, it is only returned once. The behaviour can be seen with
    the following python script:

    from socket import *;
    f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
    g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
    f.bind(('::', 0));
    addr=('::1', f.getsockname()[1]);
    g.sendto(b'', addr)
    g.sendto(b'b', addr)
    print(f.recvfrom(10, MSG_PEEK));
    print(f.recvfrom(10, MSG_PEEK));

    Where the expected output should be the empty string twice.

    Instead, make sk_peek_offset return negative values, and pass those values
    to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset
    to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
    __skb_try_recv_from_queue will then ensure the offset is reset back to 0
    if a peek is requested without an offset, unless no packets are found.

    Also simplify the if condition in __skb_try_recv_from_queue. If _off is
    greater then 0, and off is greater then or equal to skb->len, then
    (_off || skb->len) must always be true assuming skb->len >= 0 is always
    true.

    Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
    as it double checked if MSG_PEEK was set in the flags.

    V2:
    - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
    redundant checks
    - Fix peeking in udp{,v6}_recvmsg to report the right value when the
    offset is 0

    V3:
    - Marked new branch in __skb_try_recv_from_queue as unlikely.

    Signed-off-by: Matthew Dawson
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Matthew Dawson
     

11 Aug, 2017

3 commits

  • Conflicts:
    include/linux/mm_types.h
    mm/huge_memory.c

    I removed the smp_mb__before_spinlock() like the following commit does:

    8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

    and fixed up the affected commits.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Mainline had UFO fixes, but UFO is removed in net-next so we
    take the HEAD hunks.

    Minor context conflict in bcmsysport statistics bug fix.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When iteratively building a UDP datagram with MSG_MORE and that
    datagram exceeds MTU, consistently choose UFO or fragmentation.

    Once skb_is_gso, always apply ufo. Conversely, once a datagram is
    split across multiple skbs, do not consider ufo.

    Sendpage already maintains the first invariant, only add the second.
    IPv6 does not have a sendpage implementation to modify.

    A gso skb must have a partial checksum, do not follow sk_no_check_tx
    in udp_send_skb.

    Found by syzkaller.

    Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
    Reported-by: Andrey Konovalov
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

10 Aug, 2017

1 commit

  • Any use of key->enabled (that is static_key_enabled and static_key_count)
    outside jump_label_lock should handle its own serialization. The only
    two that are not doing so are the UDP encapsulation static keys. Change
    them to use static_key_enable, which now correctly tests key->enabled under
    the jump label lock.

    Signed-off-by: Paolo Bonzini
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Eric Dumazet
    Cc: Jason Baron
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1501601046-35683-3-git-send-email-pbonzini@redhat.com
    Signed-off-by: Ingo Molnar

    Paolo Bonzini
     

08 Aug, 2017

3 commits

  • Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Add a second device index, sdif, to inet socket lookups. sdif is the
    index for ingress devices enslaved to an l3mdev. It allows the lookups
    to consider the enslaved device as well as the L3 domain when searching
    for a socket.

    TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
    ingress index is obtained from IPCB using inet_sdif and after the cb move
    in tcp_v4_rcv the tcp_v4_sdif helper is used.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Add a second device index, sdif, to udp socket lookups. sdif is the
    index for ingress devices enslaved to an l3mdev. It allows the lookups
    to consider the enslaved device as well as the L3 domain when searching
    for a socket.

    Early demux lookups are handled in the next patch as part of INET_MATCH
    changes.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

07 Aug, 2017

1 commit

  • __ip_options_echo() does not need anymore skb->dst, so we can
    avoid explicitly preserving it for its own sake.

    This is almost a revert of commit 0ddf3fb2c43d ("udp: preserve
    skb->dst if required for IP options processing") plus some
    lifting to fit later changes.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

30 Jul, 2017

1 commit

  • When an early demuxed packet reaches __udp6_lib_lookup_skb(), the
    sk reference is retrieved and used, but the relevant reference
    count is leaked and the socket destructor is never called.
    Beyond leaking the sk memory, if there are pending UDP packets
    in the receive queue, even the related accounted memory is leaked.

    In the long run, this will cause persistent forward allocation errors
    and no UDP skbs (both ipv4 and ipv6) will be able to reach the
    user-space.

    Fix this by explicitly accessing the early demux reference before
    the lookup, and properly decreasing the socket reference count
    after usage.

    Also drop the skb_steal_sock() in __udp6_lib_lookup_skb(), and
    the now obsoleted comment about "socket cache".

    The newly added code is derived from the current ipv4 code for the
    similar path.

    v1 -> v2:
    fixed the __udp6_lib_rcv() return code for resubmission,
    as suggested by Eric

    Reported-by: Sam Edwards
    Reported-by: Marc Haber
    Fixes: 5425077d73e0 ("net: ipv6: Add early demux handler for UDP unicast")
    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

27 Jul, 2017

1 commit

  • We must use pre-processor conditional block or suitable accessors to
    manipulate skb->sp elsewhere builds lacking the CONFIG_XFRM will break.

    Fixes: dce4551cb2ad ("udp: preserve head state for IP_CMSG_PASSSEC")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

26 Jul, 2017

1 commit

  • Paul Moore reported a SELinux/IP_PASSSEC regression
    caused by missing skb->sp at recvmsg() time. We need to
    preserve the skb head state to process the IP_CMSG_PASSSEC
    cmsg.

    With this commit we avoid releasing the skb head state in the
    BH even if a secpath is attached to the current skb, and stores
    the skb status (with/without head states) in the scratch area,
    so that we can access it at skb deallocation time, without
    incurring in cache-miss penalties.

    This also avoids misusing the skb CB for ipv6 packets,
    as introduced by the commit 0ddf3fb2c43d ("udp: preserve
    skb->dst if required for IP options processing").

    Clean a bit the scratch area helpers implementation, to
    reduce the code differences between 32 and 64 bits build.

    Reported-by: Paul Moore
    Fixes: 0a463c78d25b ("udp: avoid a cache miss on dequeue")
    Fixes: 0ddf3fb2c43d ("udp: preserve skb->dst if required for IP options processing")
    Signed-off-by: Paolo Abeni
    Tested-by: Paul Moore
    Signed-off-by: David S. Miller

    Paolo Abeni
     

19 Jul, 2017

1 commit

  • Eric noticed that in udp_recvmsg() we still need to access
    skb->dst while processing the IP options.
    Since commit 0a463c78d25b ("udp: avoid a cache miss on dequeue")
    skb->dst is no more available at recvmsg() time and bad things
    will happen if we enter the relevant code path.

    This commit address the issue, avoid clearing skb->dst if
    any IP options are present into the relevant skb.
    Since the IP CB is contained in the first skb cacheline, we can
    test it to decide to leverage the consume_stateless_skb()
    optimization, without measurable additional cost in the faster
    path.

    v1 -> v2: updated commit message tags

    Fixes: 0a463c78d25b ("udp: avoid a cache miss on dequeue")
    Reported-by: Andrey Konovalov
    Reported-by: Eric Dumazet
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

28 Jun, 2017

1 commit


23 Jun, 2017

1 commit

  • Michael reported an UDP breakage caused by the commit b65ac44674dd
    ("udp: try to avoid 2 cache miss on dequeue").
    The function __first_packet_length() can update the checksum bits
    of the pending skb, making the scratched area out-of-sync, and
    setting skb->csum, if the skb was previously in need of checksum
    validation.

    On later recvmsg() for such skb, checksum validation will be
    invoked again - due to the wrong udp_skb_csum_unnecessary()
    value - and will fail, causing the valid skb to be dropped.

    This change addresses the issue refreshing the scratch area in
    __first_packet_length() after the possible checksum update.

    Fixes: b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue")
    Reported-by: Michael Ellerman
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

21 Jun, 2017

1 commit

  • On UDP packets processing, if the BH is the bottle-neck, it
    always sees a cache miss while updating rmem_alloc; try to
    avoid it prefetching the value as soon as we have the socket
    available.

    Performances under flood with multiple NIC rx queues used are
    unaffected, but when a single NIC rx queue is in use, this
    gives ~10% performance improvement.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

18 Jun, 2017

1 commit

  • In udp_v4/6_early_demux() code, we try to hold dst->__refcnt for
    dst with DST_NOCACHE flag. This is because later in udp_sk_rx_dst_set()
    function, we will try to cache this dst in sk for connected case.
    However, a better way to achieve this is to not try to hold dst in
    early_demux(), but in udp_sk_rx_dst_set(), call dst_hold_safe(). This
    approach is also more consistant with how tcp is handling it. And it
    will make later changes simpler.

    Signed-off-by: Wei Wang
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Wei Wang
     

12 Jun, 2017

2 commits

  • when udp_recvmsg() is executed, on x86_64 and other archs, most skb
    fields are on cold cachelines.
    If the skb are linear and the kernel don't need to compute the udp
    csum, only a handful of skb fields are required by udp_recvmsg().
    Since we already use skb->dev_scratch to cache hot data, and
    there are 32 bits unused on 64 bit archs, use such field to cache
    as much data as we can, and try to prefetch on dequeue the relevant
    fields that are left out.

    This can save up to 2 cache miss per packet.

    v1 -> v2:
    - changed udp_dev_scratch fields types to u{32,16} variant,
    replaced bitfiled with bool

    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Since UDP no more uses sk->destructor, we can clear completely
    the skb head state before enqueuing. Amend and use
    skb_release_head_state() for that.

    All head states share a single cacheline, which is not
    normally used/accesses on dequeue. We can avoid entirely accessing
    such cacheline implementing and using in the UDP code a specialized
    skb free helper which ignores the skb head state.

    This saves a cacheline miss at skb deallocation time.

    v1 -> v2:
    replaced secpath_reset() with skb_release_head_state()

    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni