15 Dec, 2015

4 commits

  • [ Upstream commit 45f6fad84cc305103b28d73482b344d7f5b76f39 ]

    This patch addresses multiple problems :

    UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
    while socket is not locked : Other threads can change np->opt
    concurrently. Dmitry posted a syzkaller
    (http://github.com/google/syzkaller) program desmonstrating
    use-after-free.

    Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
    and dccp_v6_request_recv_sock() also need to use RCU protection
    to dereference np->opt once (before calling ipv6_dup_options())

    This patch adds full RCU protection to np->opt

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 264640fc2c5f4f913db5c73fa3eb1ead2c45e9d7 ]

    If a fragmented multicast packet is received on an ethernet device which
    has an active macvlan on top of it, each fragment is duplicated and
    received both on the underlying device and the macvlan. If some
    fragments for macvlan are processed before the whole packet for the
    underlying device is reassembled, the "overlapping fragments" test in
    ip6_frag_queue() discards the whole fragment queue.

    To resolve this, add device ifindex to the search key and require it to
    match reassembling multicast packets and packets to link-local
    addresses.

    Note: similar patch has been already submitted by Yoshifuji Hideaki in

    http://patchwork.ozlabs.org/patch/220979/

    but got lost and forgotten for some reason.

    Signed-off-by: Michal Kubecek
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michal Kubeček
     
  • [ Upstream commit 4c6980462f32b4f282c5d8e5f7ea8070e2937725 ]

    Similar to ipv4, when destroying an mrt table the static mfc entries and
    the static devices are kept, which leads to devices that can never be
    destroyed (because of refcnt taken) and leaked memory. Make sure that
    everything is cleaned up on netns destruction.

    Fixes: 8229efdaef1e ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code")
    CC: Benjamin Thery
    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Aleksandrov
     
  • [ Upstream commit 41033f029e393a64e81966cbe34d66c6cf8a2e7e ]

    the OUTMCAST stat is double incremented, getting bumped once in the mcast code
    itself, and again in the common ip output path. Remove the mcast bump, as its
    not needed

    Validated by the reporter, with good results

    Signed-off-by: Neil Horman
    Reported-by: Claus Jensen
    CC: Claus Jensen
    CC: David Miller
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neil Horman
     

10 Dec, 2015

2 commits

  • [ Upstream commit 2a189f9e57650e9f310ddf4aad75d66c1233a064 ]

    In ipv6_add_dev, when addrconf_sysctl_register fails, we do not clean up
    the dev_snmp6 entry that we have already registered for this device.
    Call snmp6_unregister_dev in this case.

    Fixes: a317a2f19da7d ("ipv6: fail early when creating netdev named all or default")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     
  • [ Upstream commit 4ece9009774596ee3df0acba65a324b7ea79387c ]

    sit0 device allocates its percpu storage twice :
    - One time in ipip6_tunnel_init()
    - One time in ipip6_fb_tunnel_init()

    Thus we leak 48 bytes per possible cpu per network namespace dismantle.

    ipip6_fb_tunnel_init() can be much simpler and does not
    return an error, and should be called after register_netdev()

    Note that ipip6_tunnel_clone_6rd() also needs to be called
    after register_netdev() (calling ipip6_tunnel_init())

    Fixes: ebe084aafb7e ("sit: Use ipip6_tunnel_init as the ndo_init function.")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Steffen Klassert
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

03 Oct, 2015

4 commits

  • [ Upstream commit 6b9ea5a64ed5eeb3f68f2e6fcce0ed1179801d1e ]

    Problem:
    The ecmp route replace support for ipv6 in the kernel, deletes the
    existing ecmp route too early, ie when it installs the first nexthop.
    If there is an error in installing the subsequent nexthops, its too late
    to recover the already deleted existing route leaving the fib
    in an inconsistent state.

    This patch reduces the possibility of this by doing the following:
    a) Changes the existing multipath route add code to a two stage process:
    build rt6_infos + insert them
    ip6_route_add rt6_info creation code is moved into
    ip6_route_info_create.
    b) This ensures that most errors are caught during building rt6_infos
    and we fail early
    c) Separates multipath add and del code. Because add needs the special
    two stage mode in a) and delete essentially does not care.
    d) In any event if the code fails during inserting a route again, a
    warning is printed (This should be unlikely)

    Before the patch:
    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    /* Try replacing the route with a duplicate nexthop */
    $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
    fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
    swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
    RTNETLINK answers: File exists

    $ip -6 route show
    /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
    * kernel */

    After the patch:
    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    /* Try replacing the route with a duplicate nexthop */
    $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
    fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
    swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
    RTNETLINK answers: File exists

    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
    Signed-off-by: Roopa Prabhu
    Reviewed-by: Nikolay Aleksandrov
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Roopa Prabhu
     
  • [ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ]

    In the IPv6 multicast routing code the mrt_lock was not being released
    correctly in the MFC iterator, as a result adding or deleting a MIF would
    cause a hang because the mrt_lock could not be acquired.

    This fix is a copy of the code for the IPv4 case and ensures that the lock
    is released correctly.

    Signed-off-by: Richard Laing
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Richard Laing
     
  • [ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ]

    We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
    but in error path, we try to unregister it with inet_del_offload(). This
    doesn't seem correct, it should actually be inet6_del_offload(), also
    ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
    also uses rthdr_offload twice), but it got removed entirely later on.

    Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ]

    When a tunnel is deleted, the cached dst entry should be released.

    This problem may prevent the removal of a netns (seen with a x-netns IPv6
    gre tunnel):
    unregister_netdevice: waiting for lo to become free. Usage count = 3

    CC: Dmitry Kozlov
    Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
    Signed-off-by: huaibin Wang
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    huaibin Wang
     

30 Sep, 2015

4 commits

  • [ Upstream commit 3257d8b12f954c462d29de6201664a846328a522 ]

    In commit b357a364c57c9 ("inet: fix possible panic in
    reqsk_queue_unlink()"), I missed fact that tcp_check_req()
    can return the listener socket in one case, and that we must
    release the request socket refcount or we leak it.

    Tested:

    Following packetdrill test template shows the issue

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +0 < S 0:0(0) win 2920
    +0 > S. 0:0(0) ack 1
    +.002 < . 1:1(0) ack 21 win 2920
    +0 > R 21:21(0)

    Fixes: b357a364c57c9 ("inet: fix possible panic in reqsk_queue_unlink()")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit fdbf5b097bbd9693a86c0b8bfdd071a9a2117cfc ]

    This patch reverts 19424e052fb44da2f00d1a868cbb51f3e9f4bbb5 ("sit:
    Add gro callbacks to sit_offload") because it generates packets
    that cannot be handled even by our own GSO.

    Reported-by: Wolfgang Walter
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit 03645a11a570d52e70631838cb786eb4253eb463 ]

    ip6_datagram_connect() is doing a lot of socket changes without
    socket being locked.

    This looks wrong, at least for udp_lib_rehash() which could corrupt
    lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses.

    Signed-off-by: Eric Dumazet
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 4c938d22c88a9ddccc8c55a85e0430e9c62b1ac5 ]

    Before commit daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it
    from ip6_mc_input().") MLD packets were only processed locally. After the
    change, a copy of MLD packet goes through ip6_mr_input, causing
    MRT6MSG_NOCACHE message to be generated to user space.

    Make MLD packet only processed locally.

    Fixes: daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().")
    Signed-off-by: Hermin Anggawijaya
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Angga
     

11 Jul, 2015

1 commit

  • [ Upstream commit 34b99df4e6256ddafb663c6de0711dceceddfe0e ]

    ICMP messages can trigger ICMP and local errors. In this case
    serr->port is 0 and starting from Linux 4.0 we do not return
    the original target address to the error queue readers.
    Add function to define which errors provide addr_offset.
    With this fix my ping command is not silent anymore.

    Fixes: c247f0534cc5 ("ip: fix error queue empty skb handling")
    Signed-off-by: Julian Anastasov
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Julian Anastasov
     

11 Jun, 2015

1 commit


09 Jun, 2015

2 commits

  • UDP encapsulation is broken on IPv6. This is because the logic to resubmit
    the nexthdr is inverted, checking for a ret value > 0 instead of < 0. Also,
    the resubmit label is in the wrong position since we already get the
    nexthdr value when performing decapsulation. In addition the skb pull is no
    longer necessary either.

    This changes the return value check to look for < 0, using it for the
    nexthdr on the next iteration, and moves the resubmit label to the proper
    location.

    With these changes the v6 code now matches what we do in the v4 ip input
    code wrt resubmitting when decapsulating.

    Signed-off-by: Josh Hunt
    Acked-by: "Tom Herbert"
    Signed-off-by: David S. Miller

    Josh Hunt
     
  • The memory pointed to by idev->stats.icmpv6msgdev,
    idev->stats.icmpv6dev and idev->stats.ipv6 can each be used in an RCU
    read context without taking a reference on idev. For example, through
    IP6_*_STATS_* calls in ip6_rcv. These memory blocks are freed without
    waiting for an RCU grace period to elapse. This could lead to the
    memory being written to after it has been freed.

    Fix this by using call_rcu to free the memory used for stats, as well
    as idev after an RCU grace period has elapsed.

    Signed-off-by: Robert Shearman
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Robert Shearman
     

02 Jun, 2015

1 commit

  • We currently rely on the PMTU discovery of xfrm.
    However if a packet is localy sent, the PMTU mechanism
    of xfrm tries to to local socket notification what
    might not work for applications like ping that don't
    check for this. So add pmtu handling to vti6_xmit to
    report MTU changes immediately.

    Signed-off-by: Steffen Klassert
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Steffen Klassert
     

01 Jun, 2015

1 commit

  • We have two problems in UDP stack related to bogus checksums :

    1) We return -EAGAIN to application even if receive queue is not empty.
    This breaks applications using edge trigger epoll()

    2) Under UDP flood, we can loop forever without yielding to other
    processes, potentially hanging the host, especially on non SMP.

    This patch is an attempt to make things better.

    We might in the future add extra support for rt applications
    wanting to better control time spent doing a recv() in a hostile
    environment. For example we could validate checksums before queuing
    packets in socket receive queue.

    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 May, 2015

1 commit

  • Steffen Klassert says:

    ====================
    pull request (net): ipsec 2015-05-28

    1) Fix a race in xfrm_state_lookup_byspi, we need to take
    the refcount before we release xfrm_state_lock.
    From Li RongQing.

    2) Fix IV generation on ESN state. We used just the
    low order sequence numbers for IV generation on
    ESN, as a result the IV can repeat on the same
    state. Fix this by using the high order sequence
    number bits too and make sure to always initialize
    the high order bits with zero. These patches are
    serious stable candidates. Fixes from Herbert Xu.

    3) Fix the skb->mark handling on vti. We don't
    reset skb->mark in skb_scrub_packet anymore,
    so vti must care to restore the original
    value back after it was used to lookup the
    vti policy and state. Fixes from Alexander Duyck.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

28 May, 2015

2 commits

  • The vti6_rcv_cb and vti_rcv_cb calls were leaving the skb->mark modified
    after completing the function. This resulted in the original skb->mark
    value being lost. Since we only need skb->mark to be set for
    xfrm_policy_check we can pull the assignment into the rcv_cb calls and then
    just restore the original mark after xfrm_policy_check has been completed.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Steffen Klassert

    Alexander Duyck
     
  • Instead of modifying skb->mark we can simply modify the flowi_mark that is
    generated as a result of the xfrm_decode_session. By doing this we don't
    need to actually touch the skb->mark and it can be preserved as it passes
    out through the tunnel.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Steffen Klassert

    Alexander Duyck
     

23 May, 2015

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contain Netfilter fixes for your net tree, they are:

    1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash.
    This problem is due to wrong order in the per-net registration and netlink
    socket events. Patch from Francesco Ruggeri.

    2) Make sure that counters that userspace pass us are higher than 0 in all the
    x_tables frontends. Discovered via Trinity, patch from Dave Jones.

    3) Revert a patch for br_netfilter to rely on the conntrack status bits. This
    breaks stateless IPv6 NAT transformations. Patch from Florian Westphal.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

21 May, 2015

2 commits

  • When replacing an IPv6 multipath route with "ip route replace", i.e.
    NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
    matching route without fixing its siblings, resulting in corrupted
    siblings linked list; removing one of the siblings can then end in an
    infinite loop.

    IPv6 ECMP implementation is a bit different from IPv4 so that route
    replacement cannot work in exactly the same way. This should be a
    reasonable approximation:

    1. If the new route is ECMP-able and there is a matching ECMP-able one
    already, replace it and all its siblings (if any).

    2. If the new route is ECMP-able and no matching ECMP-able route exists,
    replace first matching non-ECMP-able (if any) or just add the new one.

    3. If the new route is not ECMP-able, replace first matching
    non-ECMP-able route (if any) or add the new route.

    We also need to remove the NLM_F_REPLACE flag after replacing old
    route(s) by first nexthop of an ECMP route so that each subsequent
    nexthop does not replace previous one.

    Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
    Signed-off-by: Michal Kubecek
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • If adding a nexthop of an IPv6 multipath route fails, comment in
    ip6_route_multipath() says we are going to delete all nexthops already
    added. However, current implementation deletes even the routes it
    hasn't even tried to add yet. For example, running

    ip route add 1234:5678::/64 \
    nexthop via fe80::aa dev dummy1 \
    nexthop via fe80::bb dev dummy1 \
    nexthop via fe80::cc dev dummy1

    twice results in removing all routes first command added.

    Limit the second (delete) run to nexthops that succeeded in the first
    (add) run.

    Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
    Signed-off-by: Michal Kubecek
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Michal Kubeček
     

20 May, 2015

2 commits

  • After improving setsockopt() coverage in trinity, I started triggering
    vmalloc failures pretty reliably from this code path:

    warn_alloc_failed+0xe9/0x140
    __vmalloc_node_range+0x1be/0x270
    vzalloc+0x4b/0x50
    __do_replace+0x52/0x260 [ip_tables]
    do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
    nf_setsockopt+0x65/0x90
    ip_setsockopt+0x61/0xa0
    raw_setsockopt+0x16/0x60
    sock_common_setsockopt+0x14/0x20
    SyS_setsockopt+0x71/0xd0

    It turns out we don't validate that the num_counters field in the
    struct we pass in from userspace is initialized.

    The same problem also exists in ebtables, arptables, ipv6, and the
    compat variants.

    Signed-off-by: Dave Jones
    Signed-off-by: Pablo Neira Ayuso

    Dave Jones
     
  • Commit ("udp: Simplify__udp*_lib_mcast_deliver")
    simplified the filter for incoming IPv6 multicast but removed
    the check of the local socket address and the UDP destination
    address.

    This patch restores the filter to prevent sockets bound to a IPv6
    multicast IP to receive other UDP traffic link unicast.

    Signed-off-by: Henning Rogge
    Fixes: 5cf3d46192fc ("udp: Simplify__udp*_lib_mcast_deliver")
    Cc: "David S. Miller"
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Henning Rogge
     

18 May, 2015

1 commit

  • commit 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages
    send from TIME_WAIT") added the flow label in the last TCP packets.
    Unfortunately, it was not casted properly.

    This patch replace the buggy shift with be32_to_cpu/cpu_to_be32.

    Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages")
    Reported-by: Eric Dumazet
    Signed-off-by: Florent Fourcot
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Florent Fourcot
     

15 May, 2015

1 commit

  • It was reported that trancerout6 would cause
    a kernel to crash when trying to compute checksums
    on raw UDP packets. The cause was the check in
    __ip6_append_data that would attempt to use
    partial checksums on the packet. However,
    raw sockets do not initialize partial checksum
    fields so partial checksums can't be used.

    Solve this the same way IPv4 does it. raw sockets
    pass transhdrlen value of 0 to ip_append_data which
    causes the checksum to be computed in software. Use
    the same check in ip6_append_data (check transhdrlen).

    Reported-by: Wolfgang Walter
    CC: Wolfgang Walter
    CC: Eric Dumazet
    Signed-off-by: Vladislav Yasevich
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

13 May, 2015

1 commit


10 May, 2015

1 commit

  • If there are only IPv6 source specific default routes present, the
    host gets -ENETUNREACH on e.g. connect() because ip6_dst_lookup_tail
    calls ip6_route_output first, and given source address any, it fails,
    and ip6_route_get_saddr is never called.

    The change is to use the ip6_route_get_saddr, even if the initial
    ip6_route_output fails, and then doing ip6_route_output _again_ after
    we have appropriate source address available.

    Note that this is '99% fix' to the problem; a correct fix would be to
    do route lookups only within addrconf.c when picking a source address,
    and never call ip6_route_output before source address has been
    populated.

    Signed-off-by: Markus Stenberg
    Signed-off-by: David S. Miller

    Markus Stenberg
     

24 Apr, 2015

1 commit

  • [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
    0000000000000080
    [ 3897.931025] IP: [] reqsk_timer_handler+0x1a6/0x243

    There is a race when reqsk_timer_handler() and tcp_check_req() call
    inet_csk_reqsk_queue_unlink() on the same req at the same time.

    Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
    timer"), listener spinlock was held and race could not happen.

    To solve this bug, we change reqsk_queue_unlink() to not assume req
    must be found, and we return a status, to conditionally release a
    refcount on the request sock.

    This also means tcp_check_req() in non fastopen case might or not
    consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
    to properly handle this.

    (Same remark for dccp_check_req() and its callers)

    inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
    called 4 times in tcp and 3 times in dccp.

    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: Eric Dumazet
    Reported-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2015

1 commit


15 Apr, 2015

2 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    A final pull request, I know it's very late but this time I think it's worth a
    bit of rush.

    The following patchset contains Netfilter/nf_tables updates for net-next, more
    specifically concatenation support and dynamic stateful expression
    instantiation.

    This also comes with a couple of small patches. One to fix the ebtables.h
    userspace header and another to get rid of an obsolete example file in tree
    that describes a nf_tables expression.

    This time, I decided to paste the original descriptions. This will result in a
    rather large commit description, but I think these bytes to keep.

    Patrick McHardy says:

    ====================
    netfilter: nf_tables: concatenation support

    The following patches add support for concatenations, which allow multi
    dimensional exact matches in O(1).

    The basic idea is to split the data registers, currently consisting of
    4 registers of 16 bytes each, into smaller units, 16 registers of 4
    bytes each, and making sure each register store always leaves the
    full 32 bit in a well defined state, meaning smaller stores will
    zero the remaining bits.

    Based on that, we can load multiple adjacent registers with different
    values, thereby building a concatenated bigger value, and use that
    value for set lookups.

    Sets are changed to use variable sized extensions for their key and
    data values, removing the fixed limit of 16 bytes while saving memory
    if less space is needed.

    As a side effect, these patches will allow some nice optimizations in
    the future, like using jhash2 in nft_hash, removing the masking in
    nft_cmp_fast, optimized data comparison using 32 bit word size etc.
    These are not done so far however.

    The patches are split up as follows:

    * the first five patches add length validation to register loads and
    stores to make sure we stay within bounds and prepare the validation
    functions for the new addressing mode

    * the next patches prepare for changing to 32 bit addressing by
    introducing a struct nft_regs, which holds the verdict register as
    well as the data registers. The verdict members are moved to a new
    struct nft_verdict to allow to pull struct nft_data out of the stack.

    * the next patches contain preparatory conversions of expressions and
    sets to use 32 bit addressing

    * the next patch introduces so far unused register conversion helpers
    for parsing and dumping register numbers over netlink

    * following is the real conversion to 32 bit addressing, consisting of
    replacing struct nft_data in struct nft_regs by an array of u32s and
    actually translating and validating the new register numbers.

    * the final two patches add support for variable sized data items and
    variable sized keys / data in set elements

    The patches have been verified to work correctly with nft binaries using
    both old and new addressing.
    ====================

    Patrick McHardy says:

    ====================
    netfilter: nf_tables: dynamic stateful expression instantiation

    The following patches are the grand finale of my nf_tables set work,
    using all the building blocks put in place by the previous patches
    to support something like iptables hashlimit, but a lot more powerful.

    Sets are extended to allow attaching expressions to set elements.
    The dynset expression dynamically instantiates these expressions
    based on a template when creating new set elements and evaluates
    them for all new or updated set members.

    In combination with concatenations this effectively creates state
    tables for arbitrary combinations of keys, using the existing
    expression types to maintain that state. Regular set GC takes care
    of purging expired states.

    We currently support two different stateful expressions, counter
    and limit. Using limit as a template we can express the functionality
    of hashlimit, but completely unrestricted in the combination of keys.
    Using counter we can perform accounting for arbitrary flows.

    The following examples from patch 5/5 show some possibilities.
    Userspace syntax is still WIP, especially the listing of state
    tables will most likely be seperated from normal set listings
    and use a more structured format:

    1. Limit the rate of new SSH connections per host, similar to iptables
    hashlimit:

    flow ip saddr timeout 60s \
    limit 10/second \
    accept

    2. Account network traffic between each set of /24 networks:

    flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \
    counter

    3. Account traffic to each host per user:

    flow skuid . ip daddr \
    counter

    4. Account traffic for each combination of source address and TCP flags:

    flow ip saddr . tcp flags \
    counter

    The resulting set content after a Xmas-scan look like this:

    {
    192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
    192.168.122.1 . ack : counter packets 74 bytes 3848,
    192.168.122.1 . psh | ack : counter packets 35 bytes 3144
    }

    In the future the "expressions attached to elements" will be extended
    to also support user created non-stateful expressions to allow to
    efficiently select beween a set of parameter sets, f.i. a set of log
    statements with different prefixes based on the interface, which currently
    require one rule each. This will most likely have to wait until the next
    kernel version though.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The dwmac-socfpga.c conflict was a case of a bug fix overlapping
    changes in net-next to handle an error pointer differently.

    Signed-off-by: David S. Miller

    David S. Miller
     

14 Apr, 2015

1 commit

  • Using a timer wheel for timewait sockets was nice ~15 years ago when
    memory was expensive and machines had a single processor.

    This does not scale, code is ugly and source of huge latencies
    (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

    We can afford to use an extra 64 bytes per timewait sock and spread
    timewait load to all cpus to have better behavior.

    Tested:

    On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
    on the target (lpaa24)

    Before patch :

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    419594

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    437171

    While test is running, we can observe 25 or even 33 ms latencies.

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
    rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
    rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

    After patch :

    About 90% increase of throughput :

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    810442

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    800992

    And latencies are kept to minimal values during this load, even
    if network utilization is 90% higher :

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
    rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Apr, 2015

2 commits

  • Switch the nf_tables registers from 128 bit addressing to 32 bit
    addressing to support so called concatenations, where multiple values
    can be concatenated over multiple registers for O(1) exact matches of
    multiple dimensions using sets.

    The old register values are mapped to areas of 128 bits for compatibility.
    When dumping register numbers, values are expressed using the old values
    if they refer to the beginning of a 128 bit area for compatibility.

    To support concatenations, register loads of less than a full 32 bit
    value need to be padded. This mainly affects the payload and exthdr
    expressions, which both unconditionally zero the last word before
    copying the data.

    Userspace fully passes the testsuite using both old and new register
    addressing.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • Replace the array of registers passed to expressions by a struct nft_regs,
    containing the verdict as a seperate member, which aliases to the
    NFT_REG_VERDICT register.

    This is needed to seperate the verdict from the data registers completely,
    so their size can be changed.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     

10 Apr, 2015

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next tree.
    They are:

    * nf_tables set timeout infrastructure from Patrick Mchardy.

    1) Add support for set timeout support.

    2) Add support for set element timeouts using the new set extension
    infrastructure.

    4) Add garbage collection helper functions to get rid of stale elements.
    Elements are accumulated in a batch that are asynchronously released
    via RCU when the batch is full.

    5) Add garbage collection synchronization helpers. This introduces a new
    element busy bit to address concurrent access from the netlink API and the
    garbage collector.

    5) Add timeout support for the nft_hash set implementation. The garbage
    collector peridically checks for stale elements from the workqueue.

    * iptables/nftables cgroup fixes:

    6) Ignore non full-socket objects from the input path, otherwise cgroup
    match may crash, from Daniel Borkmann.

    7) Fix cgroup in nf_tables.

    8) Save some cycles from xt_socket by skipping packet header parsing when
    skb->sk is already set because of early demux. Also from Daniel.

    * br_netfilter updates from Florian Westphal.

    9) Save frag_max_size and restore it from the forward path too.

    10) Use a per-cpu area to restore the original source MAC address when traffic
    is DNAT'ed.

    11) Add helper functions to access physical devices.

    12) Use these new physdev helper function from xt_physdev.

    13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter
    state information.

    14) Annotate original layer 2 protocol number in nf_bridge info, instead of
    using kludgy flags.

    15) Also annotate the pkttype mangling when the packet travels back and forth
    from the IP to the bridge layer, instead of using a flag.

    * More nf_tables set enhancement from Patrick:

    16) Fix possible usage of set variant that doesn't support timeouts.

    17) Avoid spurious "set is full" errors from Netlink API when there are pending
    stale elements scheduled to be released.

    18) Restrict loop checks to set maps.

    19) Add support for dynamic set updates from the packet path.

    20) Add support to store optional user data (eg. comments) per set element.

    BTW, I have also pulled net-next into nf-next to anticipate the conflict
    resolution between your okfn() signature changes and Florian's br_netfilter
    updates.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller