08 Feb, 2017

3 commits

  • The conflict was an interaction between a bug fix in the
    netvsc driver in 'net' and an optimization of the RX path
    in 'net-next'.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When same struct dst_entry can be used for many different
    neighbours we can not use it for pending confirmations.

    The datagram protocols can use MSG_CONFIRM to confirm the
    neighbour. When used with MSG_PROBE we do not reach the
    code where neighbour is confirmed, so we have to do the
    same slow lookup by using the dst_confirm_neigh() helper.
    When MSG_PROBE is not used, ip_append_data/ip6_append_data
    will set the skb flag dst_pending_confirm.

    Reported-by: YueHaibing
    Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.")
    Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.")
    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Dmitry reported that UDP sockets being destroyed would trigger the
    WARN_ON(atomic_read(&sk->sk_rmem_alloc)); in inet_sock_destruct()

    It turns out we do not properly destroy skb(s) that have wrong UDP
    checksum.

    Thanks again to syzkaller team.

    Fixes : 7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Cc: Paolo Abeni
    Cc: Hannes Frederic Sowa
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Jan, 2017

1 commit

  • Packets arriving in a VRF currently are delivered to UDP sockets that
    aren't bound to any interface. TCP defaults to not delivering packets
    arriving in a VRF to unbound sockets. IP route lookup and socket
    transmit both assume that unbound means using the default table and
    UDP applications that haven't been changed to be aware of VRFs may not
    function correctly in this case since they may not be able to handle
    overlapping IP address ranges, or be able to send packets back to the
    original sender if required.

    So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
    being analgous to the existing tcp_l3mdev_accept, namely to allow a
    process to have a VRF-global listen socket. Have this default to off
    as this is the behaviour that users will expect, given that there is
    no explicit mechanism to set unmodified VRF-unaware application into a
    default VRF.

    Signed-off-by: Robert Shearman
    Acked-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Robert Shearman
     

19 Jan, 2017

1 commit

  • We pass these per-protocol equal functions around in various places, but
    we can just have one function that checks the sk->sk_family and then do
    the right comparison function. I've also changed the ipv4 version to
    not cast to inet_sock since it is unneeded.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     

07 Jan, 2017

1 commit

  • UDP lib inuse checks will walk the entire hash bucket to check if the
    portaddr is in use. In the case of reuseport we can stop searching when
    we find a matching reuseport.

    On a 16-core VM a test program that spawns 16 threads that each bind to
    1024 sockets (one per 10ms) takes 1m45s. With this change it takes 11s.

    Also add a cond_resched() when the port is not specified.

    Signed-off-by: Eric Garver
    Signed-off-by: David S. Miller

    Eric Garver
     

25 Dec, 2016

1 commit


10 Dec, 2016

4 commits

  • In flood situations, keeping sk_rmem_alloc at a high value
    prevents producers from touching the socket.

    It makes sense to lower sk_rmem_alloc only at the end
    of udp_rmem_release() after the thread draining receive
    queue in udp_recvmsg() finished the writes to sk_forward_alloc.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If udp_recvmsg() constantly releases sk_rmem_alloc
    for every read packet, it gives opportunity for
    producers to immediately grab spinlocks and desperatly
    try adding another packet, causing false sharing.

    We can add a simple heuristic to give the signal
    by batches of ~25 % of the queue capacity.

    This patch considerably increases performance under
    flood by about 50 %, since the thread draining the queue
    is no longer slowed by false sharing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In UDP RX handler, we currently clear skb->dev before skb
    is added to receive queue, because device pointer is no longer
    available once we exit from RCU section.

    Since this first cache line is always hot, lets reuse this space
    to store skb->truesize and thus avoid a cache line miss at
    udp_recvmsg()/udp_skb_destructor time while receive queue
    spinlock is held.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Idea of busylocks is to let producers grab an extra spinlock
    to relieve pressure on the receive_queue spinlock shared by consumer.

    This behavior is requested only once socket receive queue is above
    half occupancy.

    Under flood, this means that only one producer can be in line
    trying to acquire the receive_queue spinlock.

    These busylock can be allocated on a per cpu manner, instead of a
    per socket one (that would consume a cache line per socket)

    This patch considerably improves UDP behavior under stress,
    depending on number of NIC RX queues and/or RPS spread.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Dec, 2016

1 commit

  • Under UDP flood, many softirq producers try to add packets to
    UDP receive queue, and one user thread is burning one cpu trying
    to dequeue packets as fast as possible.

    Two parts of the per packet cost are :
    - copying payload from kernel space to user space,
    - freeing memory pieces associated with skb.

    If socket is under pressure, softirq handler(s) can try to pull in
    skb->head the payload of the packet if it fits.

    Meaning the softirq handler(s) can free/reuse the page fragment
    immediately, instead of letting udp_recvmsg() do this hundreds of usec
    later, possibly from another node.

    Additional gains :
    - We reduce skb->truesize and thus can store more packets per SO_RCVBUF
    - We avoid cache line misses at copyout() time and consume_skb() time,
    and avoid one put_page() with potential alien freeing on NUMA hosts.

    This comes at the cost of a copy, bounded to available tail room, which
    is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
    than necessary)

    This patch gave me about 5 % increase in throughput in my tests.

    skb_condense() helper could probably used in other contexts.

    Signed-off-by: Eric Dumazet
    Cc: Paolo Abeni
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Dec, 2016

1 commit

  • Before commit 850cbaddb52d ("udp: use it's own memory accounting
    schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
    the rcvbuf by the whole current packet's truesize. After said commit
    we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
    is empty. As reported by Jesper this cause a performance regression
    for some (small) values of rcvbuf.

    This commit is intended to fix the regression restoring the old
    handling of the rcvbuf limit.

    Reported-by: Jesper Dangaard Brouer
    Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

27 Nov, 2016

1 commit


25 Nov, 2016

1 commit

  • In commits 93821778def10 ("udp: Fix rcv socket locking") and
    f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
    __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
    was forgotten.

    This leads to crashes if UDPlite header is pulled twice, which happens
    starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets
    before queueing")

    Bug found by syzkaller team, thanks a lot guys !

    Note that backlog use in UDP/UDPlite is scheduled to be removed starting
    from linux-4.10, so this patch is only needed up to linux-4.9

    Fixes: 93821778def1 ("udp: Fix rcv socket locking")
    Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Cc: Benjamin LaHaise
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Nov, 2016

1 commit

  • All conflicts were simple overlapping changes except perhaps
    for the Thunder driver.

    That driver has a change_mtu method explicitly for sending
    a message to the hardware. If that fails it returns an
    error.

    Normally a driver doesn't need an ndo_change_mtu method becuase those
    are usually just range changes, which are now handled generically.
    But since this extra operation is needed in the Thunder driver, it has
    to stay.

    However, if the message send fails we have to restore the original
    MTU before the change because the entire call chain expects that if
    an error is thrown by ndo_change_mtu then the MTU did not change.
    Therefore code is added to nicvf_change_mtu to remember the original
    MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Nov, 2016

1 commit

  • UDP_SKB_CB(skb)->partial_cov is located at offset 66 in skb,
    requesting a cold cache line being read in cpu cache.

    We can avoid this cache line miss for UDP sockets,
    as partial_cov has a meaning only for UDPLite.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Nov, 2016

1 commit

  • UDP busy polling is restricted to connected UDP sockets.

    This is because sk_busy_loop() only takes care of one NAPI context.

    There are cases where it could be extended.

    1) Some hosts receive traffic on a single NIC, with one RX queue.

    2) Some applications use SO_REUSEPORT and associated BPF filter
    to split the incoming traffic on one UDP socket per RX
    queue/thread/cpu

    3) Some UDP sockets are used to send/receive traffic for one flow, but
    they do not bother with connect()

    This patch records the napi_id of first received skb, giving more
    reach to busy polling.

    Tested:

    lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
    lpaa24:~# echo 70 >/proc/sys/net/core/busy_read

    lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done

    Before patch :
    27867 28870 37324 41060 41215
    36764 36838 44455 41282 43843
    After patch :
    73920 73213 70147 74845 71697
    68315 68028 75219 70082 73707

    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Nov, 2016

2 commits

  • Honor udptable parameter that is passed to __udp*_lib_mcast_deliver(),
    otherwise udplite broadcast/multicast use the wrong table and it breaks.

    Fixes: 2dc41cff7545 ("udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.")
    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • The commit 850cbaddb52d ("udp: use it's own memory accounting schema")
    assumes that the socket proto has memory accounting enabled,
    but this is not the case for UDPLITE.
    Fix it enabling memory accounting for UDPLITE and performing
    fwd allocated memory reclaiming on socket shutdown.
    UDP and UDPLITE share now the same memory accounting limits.
    Also drop the backlog receive operation, since is no more needed.

    Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
    Reported-by: Andrei Vagin
    Suggested-by: Eric Dumazet
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

14 Nov, 2016

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains a second batch of Netfilter updates for
    your net-next tree. This includes a rework of the core hook
    infrastructure that improves Netfilter performance by ~15% according to
    synthetic benchmarks. Then, a large batch with ipset updates, including
    a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a
    couple of assorted updates.

    Regarding the core hook infrastructure rework to improve performance,
    using this simple drop-all packets ruleset from ingress:

    nft add table netdev x
    nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; }
    nft add rule netdev x y drop

    And generating traffic through Jesper Brouer's
    samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i
    option. perf report shows nf_tables calls in its top 10:

    17.30% kpktgend_0 [nf_tables] [k] nft_do_chain
    15.75% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core
    10.39% kpktgend_0 [nf_tables_netdev] [k] nft_do_chain_netdev

    I'm measuring here an improvement of ~15% in performance with this
    patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R)
    Core(TM) i5-3320M CPU @ 2.60GHz 4-cores.

    This rework contains more specifically, in strict order, these patches:

    1) Remove compile-time debugging from core.

    2) Remove obsolete comments that predate the rcu era. These days it is
    well known that a Netfilter hook always runs under rcu_read_lock().

    3) Remove threshold handling, this is only used by br_netfilter too.
    We already have specific code to handle this from br_netfilter,
    so remove this code from the core path.

    4) Deprecate NF_STOP, as this is only used by br_netfilter.

    5) Place nf_state_hook pointer into xt_action_param structure, so
    this structure fits into one single cacheline according to pahole.
    This also implicit affects nftables since it also relies on the
    xt_action_param structure.

    6) Move state->hook_entries into nf_queue entry. The hook_entries
    pointer is only required by nf_queue(), so we can store this in the
    queue entry instead.

    7) use switch() statement to handle verdict cases.

    8) Remove hook_entries field from nf_hook_state structure, this is only
    required by nf_queue, so store it in nf_queue_entry structure.

    9) Merge nf_iterate() into nf_hook_slow() that results in a much more
    simple and readable function.

    10) Handle NF_REPEAT away from the core, so far the only client is
    nf_conntrack_in() and we can restart the packet processing using a
    simple goto to jump back there when the TCP requires it.
    This update required a second pass to fix fallout, fix from
    Arnd Bergmann.

    11) Set random seed from nft_hash when no seed is specified from
    userspace.

    12) Simplify nf_tables expression registration, in a much smarter way
    to save lots of boiler plate code, by Liping Zhang.

    13) Simplify layer 4 protocol conntrack tracker registration, from
    Davide Caratti.

    14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due
    to recent generalization of the socket infrastructure, from Arnd
    Bergmann.

    15) Then, the ipset batch from Jozsef, he describes it as it follows:

    * Cleanup: Remove extra whitespaces in ip_set.h
    * Cleanup: Mark some of the helpers arguments as const in ip_set.h
    * Cleanup: Group counter helper functions together in ip_set.h
    * struct ip_set_skbinfo is introduced instead of open coded fields
    in skbinfo get/init helper funcions.
    * Use kmalloc() in comment extension helper instead of kzalloc()
    because it is unnecessary to zero out the area just before
    explicit initialization.
    * Cleanup: Split extensions into separate files.
    * Cleanup: Separate memsize calculation code into dedicated function.
    * Cleanup: group ip_set_put_extensions() and ip_set_get_extensions()
    together.
    * Add element count to hash headers by Eric B Munson.
    * Add element count to all set types header for uniform output
    across all set types.
    * Count non-static extension memory into memsize calculation for
    userspace.
    * Cleanup: Remove redundant mtype_expire() arguments, because
    they can be get from other parameters.
    * Cleanup: Simplify mtype_expire() for hash types by removing
    one level of intendation.
    * Make NLEN compile time constant for hash types.
    * Make sure element data size is a multiple of u32 for the hash set
    types.
    * Optimize hash creation routine, exit as early as possible.
    * Make struct htype per ipset family so nets array becomes fixed size
    and thus simplifies the struct htype allocation.
    * Collapse same condition body into a single one.
    * Fix reported memory size for hash:* types, base hash bucket structure
    was not taken into account.
    * hash:ipmac type support added to ipset by Tomasz Chilinski.
    * Use setup_timer() and mod_timer() instead of init_timer()
    by Muhammad Falak R Wani, individually for the set type families.

    16) Remove useless connlabel field in struct netns_ct, patch from
    Florian Westphal.

    17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify
    {ip,ip6,arp}tables code that uses this.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Nov, 2016

1 commit

  • Since commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
    the udp6_lib_lookup and udp4_lib_lookup functions are only
    provided when it is actually possible to call them.

    However, moving the callers now caused a link error:

    net/built-in.o: In function `nf_sk_lookup_slow_v6':
    (.text+0x131a39): undefined reference to `udp6_lib_lookup'
    net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4':
    nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup'

    This extends the #ifdef so we also provide the functions when
    CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively
    are set.

    Fixes: 8db4c5be88f6 ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Pablo Neira Ayuso

    Arnd Bergmann
     

08 Nov, 2016

2 commits

  • A new argument is added to __skb_recv_datagram to provide
    an explicit skb destructor, invoked under the receive queue
    lock.
    The UDP protocol uses such argument to perform memory
    reclaiming on dequeue, so that the UDP protocol does not
    set anymore skb->desctructor.
    Instead explicit memory reclaiming is performed at close() time and
    when skbs are removed from the receive queue.
    The in kernel UDP protocol users now need to call a
    skb_recv_udp() variant instead of skb_recv_datagram() to
    properly perform memory accounting on dequeue.

    Overall, this allows acquiring only once the receive queue
    lock on dequeue.

    Tested using pktgen with random src port, 64 bytes packet,
    wire-speed on a 10G link as sender and udp_sink as the receiver,
    using an l4 tuple rxhash to stress the contention, and one or more
    udp_sink instances with reuseport.

    nr sinks vanilla patched
    1 440 560
    3 2150 2300
    6 3650 3800
    9 4450 4600
    12 6250 6450

    v1 -> v2:
    - do rmem and allocated memory scheduling under the receive lock
    - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
    - avoid the typdef for the dequeue callback

    Suggested-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • So that we can use it even after orphaining the skbuff.

    Suggested-by: Eric Dumazet
    Signed-off-by: Paolo Abeni
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

05 Nov, 2016

1 commit

  • - Use the UID in routing lookups made by protocol connect() and
    sendmsg() functions.
    - Make sure that routing lookups triggered by incoming packets
    (e.g., Path MTU discovery) take the UID of the socket into
    account.
    - For packets not associated with a userspace socket, (e.g., ping
    replies) use UID 0 inside the user namespace corresponding to
    the network namespace the socket belongs to. This allows
    all namespaces to apply routing and iptables rules to
    kernel-originated traffic in that namespaces by matching UID 0.
    This is better than using the UID of the kernel socket that is
    sending the traffic, because the UID of kernel sockets created
    at namespace creation time (e.g., the per-processor ICMP and
    TCP sockets) is the UID of the user that created the socket,
    which might not be mapped in the namespace.

    Tested: compiles allnoconfig, allyesconfig, allmodconfig
    Tested: https://android-review.googlesource.com/253302
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

31 Oct, 2016

1 commit


27 Oct, 2016

1 commit

  • First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
    ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
    AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
    ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));

    Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
    queueing") forgot to adjust the offsets now UDP headers are pulled
    before skb are put in receive queue.

    Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Signed-off-by: Eric Dumazet
    Cc: Sam Kumar
    Cc: Willem de Bruijn
    Tested-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Oct, 2016

2 commits

  • Completely avoid default sock memory accounting and replace it
    with udp-specific accounting.

    Since the new memory accounting model encapsulates completely
    the required locking, remove the socket lock on both enqueue and
    dequeue, and avoid using the backlog on enqueue.

    Be sure to clean-up rx queue memory on socket destruction, using
    udp its own sk_destruct.

    Tested using pktgen with random src port, 64 bytes packet,
    wire-speed on a 10G link as sender and udp_sink as the receiver,
    using an l4 tuple rxhash to stress the contention, and one or more
    udp_sink instances with reuseport.

    nr readers Kpps (vanilla) Kpps (patched)
    1 170 440
    3 1250 2150
    6 3000 3650
    9 4200 4450
    12 5700 6250

    v4 -> v5:
    - avoid unneeded test in first_packet_length

    v3 -> v4:
    - remove useless sk_rcvqueues_full() call

    v2 -> v3:
    - do not set the now unsed backlog_rcv callback

    v1 -> v2:
    - add memory pressure support
    - fixed dropwatch accounting for ipv6

    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Avoid using the generic helpers.
    Use the receive queue spin lock to protect the memory
    accounting operation, both on enqueue and on dequeue.

    On dequeue perform partial memory reclaiming, trying to
    leave a quantum of forward allocated memory.

    On enqueue use a custom helper, to allow some optimizations:
    - use a plain spin_lock() variant instead of the slightly
    costly spin_lock_irqsave(),
    - avoid dst_force check, since the calling code has already
    dropped the skb dst
    - avoid orphaning the skb, since skb_steal_sock() already did
    the work for us

    The above needs custom memory reclaiming on shutdown, provided
    by the udp_destruct_sock().

    v5 -> v6:
    - don't orphan the skb on enqueue

    v4 -> v5:
    - replace the mem_lock with the receive queue spin lock
    - ensure that the bh is always allowed to enqueue at least
    a skb, even if sk_rcvbuf is exceeded

    v3 -> v4:
    - reworked memory accunting, simplifying the schema
    - provide an helper for both memory scheduling and enqueuing

    v1 -> v2:
    - use a udp specific destrctor to perform memory reclaiming
    - remove a couple of helpers, unneeded after the above cleanup
    - do not reclaim memory on dequeue if not under memory
    pressure
    - reworked the fwd accounting schema to avoid potential
    integer overflow

    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

21 Oct, 2016

1 commit

  • Baozeng Ding reported KASAN traces showing uses after free in
    udp_lib_get_port() and other related UDP functions.

    A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.

    I could write a reproducer with two threads doing :

    static int sock_fd;
    static void *thr1(void *arg)
    {
    for (;;) {
    connect(sock_fd, (const struct sockaddr *)arg,
    sizeof(struct sockaddr_in));
    }
    }

    static void *thr2(void *arg)
    {
    struct sockaddr_in unspec;

    for (;;) {
    memset(&unspec, 0, sizeof(unspec));
    connect(sock_fd, (const struct sockaddr *)&unspec,
    sizeof(unspec));
    }
    }

    Problem is that udp_disconnect() could run without holding socket lock,
    and this was causing list corruptions.

    Signed-off-by: Eric Dumazet
    Reported-by: Baozeng Ding
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Sep, 2016

1 commit


30 Aug, 2016

1 commit


24 Aug, 2016

4 commits

  • Since we no longer use SLAB_DESTROY_BY_RCU for UDP,
    we do not need sk_prot_clear_portaddr_nulls() helper.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This implements SOCK_DESTROY for UDP sockets similar to what was done
    for TCP with commit c1e64e298b8ca ("net: diag: Support destroying TCP
    sockets.") A process with a UDP socket targeted for destroy is awakened
    and recvmsg fails with ECONNABORTED.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • After commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
    we do not need this special allocation mode anymore, even if it is
    harmless.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Laura tracked poll() [and friends] regression caused by commit
    e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

    udp_poll() needs to know if there is a valid packet in receive queue,
    even if its payload length is 0.

    Change first_packet_length() to return an signed int, and use -1
    as the indication of an empty queue.

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Reported-by: Laura Abbott
    Signed-off-by: Eric Dumazet
    Tested-by: Laura Abbott
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Aug, 2016

1 commit

  • Include ipv4_rcv_saddr_equal() definition to avoid this sparse error :

    net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not
    declared. Should it be static?

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jul, 2016

1 commit

  • After a612769774a3 ("udp: prevent bugcheck if filter truncates packet
    too much"), there followed various other fixes for similar cases such
    as f4979fcea7fd ("rose: limit sk_filter trim to payload").

    Latter introduced a new helper sk_filter_trim_cap(), where we can pass
    the trim limit directly to the socket filter handling. Make use of it
    here as well with sizeof(struct udphdr) as lower cap limit and drop the
    extra skb->len test in UDP's input path.

    Signed-off-by: Daniel Borkmann
    Cc: Willem de Bruijn
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

12 Jul, 2016

1 commit

  • If socket filter truncates an udp packet below the length of UDP header
    in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
    BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
    kernel is configured that way) can be easily enforced by an unprivileged
    user which was reported as CVE-2016-6162. For a reproducer, see
    http://seclists.org/oss-sec/2016/q3/8

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Reported-by: Marco Grassi
    Signed-off-by: Michal Kubecek
    Acked-by: Eric Dumazet
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Michal Kubeček
     

15 Jun, 2016

1 commit

  • There is a corner case in which udp packets belonging to a same
    flow are hashed to different socket when hslot->count changes from 10
    to 11:

    1) When hslot->count hash,
    and always passes 'daddr' to udp_ehashfn().

    2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
    but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
    INADDR_ANY instead of some specific addr.

    That means when hslot->count changes from 10 to 11, the hash calculated by
    udp_ehashfn() is also changed, and the udp packets belonging to a same
    flow will be hashed to different socket.

    This is easily reproduced:
    1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
    2) From the same host send udp packets to 127.0.0.1:40000, record the
    socket index which receives the packets.
    3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
    is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
    same hslot as the aformentioned 10 sockets, and makes the hslot->count
    change from 10 to 11.
    4) From the same host send udp packets to 127.0.0.1:40000, and the socket
    index which receives the packets will be different from the one received
    in step 2.
    This should not happen as the socket bound to 0.0.0.0:44096 should not
    change the behavior of the sockets bound to 0.0.0.0:40000.

    It's the same case for IPv6, and this patch also fixes that.

    Signed-off-by: Su, Xuemin
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Su, Xuemin