18 Feb, 2017

7 commits

  • commit bf99b4ded5f8a4767dbb9d180626f06c51f9881f upstream.

    Otherwise, RST packets generated by the TCP stack for non-existing
    sockets always have mark 0.
    The mark from the original packet is assigned to the netns_ipv4/6
    socket used to send the response so that it can get copied into the
    response skb when the socket sends it.

    Fixes: e110861f8609 ("net: add a sysctl to reflect the fwmark on replies")
    Cc: Lorenzo Colitti
    Signed-off-by: Pau Espin Pedrol
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Pau Espin Pedrol
     
  • [ Upstream commit 9c8bb163ae784be4f79ae504e78c862806087c54 ]

    In function igmpv3/mld_add_delrec() we allocate pmc and put it in
    idev->mc_tomb, so we should free it when we don't need it in del_delrec().
    But I removed kfree(pmc) incorrectly in latest two patches. Now fix it.

    Fixes: 24803f38a5c0 ("igmp: do not remove igmp souce list info when ...")
    Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when ...")
    Reported-by: Daniel Borkmann
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit 73d2c6678e6c3af7e7a42b1e78cd0211782ade32 ]

    Andrey reported a kernel crash:

    general protection fault: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 2 PID: 3880 Comm: syz-executor1 Not tainted 4.10.0-rc6+ #124
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff880060048040 task.stack: ffff880069be8000
    RIP: 0010:ping_v4_push_pending_frames net/ipv4/ping.c:647 [inline]
    RIP: 0010:ping_v4_sendmsg+0x1acd/0x23f0 net/ipv4/ping.c:837
    RSP: 0018:ffff880069bef8b8 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: ffff880069befb90 RCX: 0000000000000000
    RDX: 0000000000000018 RSI: ffff880069befa30 RDI: 00000000000000c2
    RBP: ffff880069befbb8 R08: 0000000000000008 R09: 0000000000000000
    R10: 0000000000000002 R11: 0000000000000000 R12: ffff880069befab0
    R13: ffff88006c624a80 R14: ffff880069befa70 R15: 0000000000000000
    FS: 00007f6f7c716700(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000004a6f28 CR3: 000000003a134000 CR4: 00000000000006e0
    Call Trace:
    inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
    sock_sendmsg_nosec net/socket.c:635 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:645
    SYSC_sendto+0x660/0x810 net/socket.c:1687
    SyS_sendto+0x40/0x50 net/socket.c:1655
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    This is because we miss a check for NULL pointer for skb_peek() when
    the queue is empty. Other places already have the same check.

    Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    WANG Cong
     
  • [ Upstream commit ccf7abb93af09ad0868ae9033d1ca8108bdaec82 ]

    Splicing from TCP socket is vulnerable when a packet with URG flag is
    received and stored into receive queue.

    __tcp_splice_read() returns 0, and sk_wait_data() immediately
    returns since there is the problematic skb in queue.

    This is a nice way to burn cpu (aka infinite loop) and trigger
    soft lockups.

    Again, this gem was found by syzkaller tool.

    Fixes: 9c55e01c0cc8 ("[TCP]: Splice receive support.")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Willy Tarreau
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit d71b7896886345c53ef1d84bda2bc758554f5d61 ]

    syzkaller found another out of bound access in ip_options_compile(),
    or more exactly in cipso_v4_validate()

    Fixes: 20e2a8648596 ("cipso: handle CIPSO options correctly when NetLabel is disabled")
    Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Paul Moore
    Acked-by: Paul Moore
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 34b2cef20f19c87999fff3da4071e66937db9644 ]

    Andrey Konovalov got crashes in __ip_options_echo() when a NULL skb->dst
    is accessed.

    ipv4_pktinfo_prepare() should not drop the dst if (evil) IP options
    are present.

    We could refine the test to the presence of ts_needtime or srr,
    but IP options are not often used, so let's be conservative.

    Thanks to syzkaller team for finding this bug.

    Fixes: d826eb14ecef ("ipv4: PKTINFO doesnt need dst reference")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 06425c308b92eaf60767bc71d359f4cbc7a561f8 ]

    syszkaller fuzzer was able to trigger a divide by zero, when
    TCP window scaling is not enabled.

    SO_RCVBUF can be used not only to increase sk_rcvbuf, also
    to decrease it below current receive buffers utilization.

    If mss is negative or 0, just return a zero TCP window.

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

04 Feb, 2017

6 commits

  • [ Upstream commit 88ff7334f25909802140e690c0e16433e485b0a0 ]

    Modules implementing lwtunnel ops should not be allowed to unload
    while there is state alive using those ops, so specify the owning
    module for all lwtunnel ops.

    Signed-off-by: Robert Shearman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Robert Shearman
     
  • [ Upstream commit 0dbd7ff3ac5017a46033a9d0a87a8267d69119d9 ]

    Found that if we run LTP netstress test with large MSS (65K),
    the first attempt from server to send data comparable to this
    MSS on fastopen connection will be delayed by the probe timer.

    Here is an example:

    < S seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
    > S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
    < . ack 1 win 342 length 0

    Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
    as well as in 'size_goal'. This results the segment not queued for
    transmition until all the data copied from user buffer. Then, inside
    __tcp_push_pending_frames(), it breaks on send window test and
    continues with the check probe timer.

    Fragmentation occurs in tcp_write_wakeup()...

    +0.2 > P. seq 1:43777 ack 1 win 342 length 43776
    < . ack 43777, win 1365 length 0
    > P. seq 43777:65001 ack 1 win 342 options [...] length 21224
    ...

    This also contradicts with the fact that we should bound to the half
    of the window if it is large.

    Fix this flaw by correctly initializing max_window. Before that, it
    could have large values that affect further calculations of 'size_goal'.

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Alexey Kodanev
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kodanev
     
  • [ Upstream commit 9ed59592e3e379b2e9557dc1d9e9ec8fcbb33f16]

    Trying to add an mpls encap route when the MPLS modules are not loaded
    hangs. For example:

    CONFIG_MPLS=y
    CONFIG_NET_MPLS_GSO=m
    CONFIG_MPLS_ROUTING=m
    CONFIG_MPLS_IPTUNNEL=m

    $ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

    The ip command hangs:
    root 880 826 0 21:25 pts/0 00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

    $ cat /proc/880/stack
    [] call_usermodehelper_exec+0xd6/0x134
    [] __request_module+0x27b/0x30a
    [] lwtunnel_build_state+0xe4/0x178
    [] fib_create_info+0x47f/0xdd4
    [] fib_table_insert+0x90/0x41f
    [] inet_rtm_newroute+0x4b/0x52
    ...

    modprobe is trying to load rtnl-lwt-MPLS:

    root 881 5 0 21:25 ? 00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS

    and it hangs after loading mpls_router:

    $ cat /proc/881/stack
    [] rtnl_lock+0x12/0x14
    [] register_netdevice_notifier+0x16/0x179
    [] mpls_init+0x25/0x1000 [mpls_router]
    [] do_one_initcall+0x8e/0x13f
    [] do_init_module+0x5a/0x1e5
    [] load_module+0x13bd/0x17d6
    ...

    The problem is that lwtunnel_build_state is called with rtnl lock
    held preventing mpls_init from registering.

    Given the potential references held by the time lwtunnel_build_state it
    can not drop the rtnl lock to the load module. So, extract the module
    loading code from lwtunnel_build_state into a new function to validate
    the encap type. The new function is called while converting the user
    request into a fib_config which is well before any table, device or
    fib entries are examined.

    Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit 003c941057eaa868ca6fedd29a274c863167230d ]

    Fix up a data alignment issue on sparc by swapping the order
    of the cookie byte array field with the length field in
    struct tcp_fastopen_cookie, and making it a proper union
    to clean up the typecasting.

    This addresses log complaints like these:
    log_unaligned: 113 callbacks suppressed
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
    Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
    Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
    Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

    Cc: Eric Dumazet
    Signed-off-by: Shannon Nelson
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Shannon Nelson
     
  • [ Upstream commit 8a430ed50bb1b19ca14a46661f3b1b35f2fb5c39 ]

    rtm_table is an 8-bit field while table ids are allowed up to u32. Commit
    709772e6e065 ("net: Fix routing tables with id > 255 for legacy software")
    added the preference to set rtm_table in dumps to RT_TABLE_COMPAT if the
    table id is > 255. The table id returned on get route requests should do
    the same.

    Fixes: c36ba6603a11 ("net: Allow user to get table id from route lookup")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit ea7a80858f57d8878b1499ea0f1b8a635cc48de7 ]

    Handle failure in lwtunnel_fill_encap adding attributes to skb.

    Fixes: 571e722676fe ("ipv4: support for fib route lwtunnel encap attributes")
    Fixes: 19e42e451506 ("ipv6: support for fib route lwtunnel encap attributes")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     

15 Jan, 2017

6 commits

  • [ Upstream commit 7a18c5b9fb31a999afc62b0e60978aa896fc89e9 ]

    fib_select_path does not call fib_select_multipath if oif is set in the
    flow struct. For VRF use cases oif is always set, so multipath route
    selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
    check similar to what is done in fib_table_lookup.

    Add saddr and proto to the flow struct for the fib lookup done by the
    VRF driver to better match hash computation for a flow.

    Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit 5350d54f6cd12eaff623e890744c79b700bd3f17 ]

    In the case of custom rules being present we need to handle the case of the
    LOCAL table being intialized after the new rule has been added. To address
    that I am adding a new check so that we can make certain we don't use an
    alias of MAIN for LOCAL when allocating a new table.

    Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
    Reported-by: Oliver Brunel
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexander Duyck
     
  • [ Upstream commit 7ababb782690e03b78657e27bd051e20163af2d6 ]

    5.2. Action on Reception of a Query

    When a system receives a Query, it does not respond immediately.
    Instead, it delays its response by a random amount of time, bounded
    by the Max Resp Time value derived from the Max Resp Code in the
    received Query message. A system may receive a variety of Queries on
    different interfaces and of different kinds (e.g., General Queries,
    Group-Specific Queries, and Group-and-Source-Specific Queries), each
    of which may require its own delayed response.

    Before scheduling a response to a Query, the system must first
    consider previously scheduled pending responses and in many cases
    schedule a combined response. Therefore, the system must be able to
    maintain the following state:

    o A timer per interface for scheduling responses to General Queries.

    o A per-group and interface timer for scheduling responses to Group-
    Specific and Group-and-Source-Specific Queries.

    o A per-group and interface list of sources to be reported in the
    response to a Group-and-Source-Specific Query.

    When a new Query with the Router-Alert option arrives on an
    interface, provided the system has state to report, a delay for a
    response is randomly selected in the range (0, [Max Resp Time]) where
    Max Resp Time is derived from Max Resp Code in the received Query
    message. The following rules are then used to determine if a Report
    needs to be scheduled and the type of Report to schedule. The rules
    are considered in order and only the first matching rule is applied.

    1. If there is a pending response to a previous General Query
    scheduled sooner than the selected delay, no additional response
    needs to be scheduled.

    2. If the received Query is a General Query, the interface timer is
    used to schedule a response to the General Query after the
    selected delay. Any previously pending response to a General
    Query is canceled.
    --8
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michal Tesar
     
  • [ Upstream commit f5a0aab84b74de68523599817569c057c7ac1622 ]

    IPv4 output routes already use l3mdev device instead of loopback for dst's
    if it is applicable. Change local input routes to do the same.

    This fixes icmp responses for unreachable UDP ports which are directed
    to the wrong table after commit 9d1a6c4ea43e4 because local_input
    routes use the loopback device. Moving from ingress device to loopback
    loses the L3 domain causing responses based on the dst to get to lost.

    Fixes: 9d1a6c4ea43e4 ("net: icmp_route_lookup should use rt dev to
    determine L3 domain")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit f0c16ba8933ed217c2688b277410b2a37ba81591 ]

    When we send a packet for our own local address on a non-loopback
    interface (e.g. eth0), due to the change had been introduced from
    commit 0b922b7a829c ("net: original ingress device index in PKTINFO"), the
    original ingress device index would be set as the loopback interface.
    However, the packet should be considered as if it is being arrived via the
    sending interface (eth0), otherwise it would break the expectation of the
    userspace application (e.g. the DHCPRELEASE message from dhcp_release
    binary would be ignored by the dnsmasq daemon, since it come from lo which
    is not the interface dnsmasq bind to)

    Fixes: 0b922b7a829c ("net: original ingress device index in PKTINFO")
    Acked-by: David Ahern
    Signed-off-by: Wei Zhang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Wei Zhang
     
  • [ Upstream commit 39b2dd765e0711e1efd1d1df089473a8dd93ad48 ]

    Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
    the packet. For sockets that have transport headers pulled, transport
    offset can be negative. Use signed comparison to avoid overflow.

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Reported-by: Nisar Jagabar
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

07 Dec, 2016

1 commit

  • There have been some reports lately about TCP connection stalls caused
    by NIC drivers that aren't setting gso_size on aggregated packets on rx
    path. This causes TCP to assume that the MSS is actually the size of the
    aggregated packet, which is invalid.

    Although the proper fix is to be done at each driver, it's often hard
    and cumbersome for one to debug, come to such root cause and report/fix
    it.

    This patch amends this situation in two ways. First, it adds a warning
    on when this situation occurs, so it gives a hint to those trying to
    debug this. It also limit the maximum probed MSS to the adverised MSS,
    as it should never be any higher than that.

    The result is that the connection may not have the best performance ever
    but it shouldn't stall, and the admin will have a hint on what to look
    for.

    Tested with virtio by forcing gso_size to 0.

    v2: updated msg per David's suggestion
    v3: use skb_iif to find the interface and also log its name, per Eric
    Dumazet's suggestion. As the skb may be backlogged and the interface
    gone by then, we need to check if the number still has a meaning.
    v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per
    David's suggestion

    Cc: Jonathan Maxwell
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

06 Dec, 2016

3 commits

  • Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there
    was no check that the iovec contained enough bytes for an ICMP header,
    and the read loop would walk across neighboring stack contents. Since the
    iov_iter conversion, bad arguments are noticed, but the returned error is
    EFAULT. Returning EINVAL is a clearer error and also solves the problem
    prior to v3.19.

    This was found using trinity with KASAN on v3.18:

    BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
    Read of size 8 by task trinity-c2/9623
    page:ffffffbe034b9a08 count:0 mapcount:0 mapping: (null) index:0x0
    flags: 0x0()
    page dumped because: kasan: bad access detected
    CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G BU 3.18.0-dirty #15
    Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
    Call trace:
    [] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
    [] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
    [< inline >] __dump_stack lib/dump_stack.c:15
    [] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
    [< inline >] print_address_description mm/kasan/report.c:147
    [< inline >] kasan_report_error mm/kasan/report.c:236
    [] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
    [< inline >] check_memory_region mm/kasan/kasan.c:264
    [] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
    [] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
    [< inline >] memcpy_from_msg include/linux/skbuff.h:2667
    [] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
    [] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
    [] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
    [< inline >] __sock_sendmsg_nosec net/socket.c:624
    [< inline >] __sock_sendmsg net/socket.c:632
    [] sock_sendmsg+0x124/0x164 net/socket.c:643
    [< inline >] SYSC_sendto net/socket.c:1797
    [] SyS_sendto+0x178/0x1d8 net/socket.c:1761

    CVE-2016-8399

    Reported-by: Qidan He
    Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     
  • It has been reported that update_suffix can be expensive when it is called
    on a large node in which most of the suffix lengths are the same. The time
    required to add 200K entries had increased from around 3 seconds to almost
    49 seconds.

    In order to address this we need to move the code for updating the suffix
    out of resize and instead just have it handled in the cases where we are
    pushing a node that increases the suffix length, or will decrease the
    suffix length.

    Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
    Reported-by: Robert Shearman
    Signed-off-by: Alexander Duyck
    Reviewed-by: Robert Shearman
    Tested-by: Robert Shearman
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • It wasn't necessary to pass a leaf in when doing the suffix updates so just
    drop it. Instead just pass the suffix and work with that.

    Since we dropped the leaf there is no need to include that in the name so
    the names are updated to node_push_suffix and node_pull_suffix.

    Finally I noticed that the logic for pulling the suffix length back
    actually had some issues. Specifically it would stop prematurely if there
    was a longer suffix, but it was not as long as the original suffix. I
    updated the code to address that in node_pull_suffix.

    Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
    Suggested-by: Robert Shearman
    Signed-off-by: Alexander Duyck
    Reviewed-by: Robert Shearman
    Tested-by: Robert Shearman
    Signed-off-by: David S. Miller

    Alexander Duyck
     

03 Dec, 2016

1 commit

  • When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

    where skb_gso_segment() relies on skb->protocol to function properly.

    This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
    fixing a bug where GSO packets sent through a sit tunnel are dropped
    when xfrm is involved.

    Cc: stable@vger.kernel.org
    Signed-off-by: Eli Cooper
    Signed-off-by: David S. Miller

    Eli Cooper
     

02 Dec, 2016

2 commits

  • Steffen Klassert says:

    ====================
    pull request (net): ipsec 2016-12-01

    1) Change the error value when someone tries to run 32bit
    userspace on a 64bit host from -ENOTSUPP to the userspace
    exported -EOPNOTSUPP. Fix from Yi Zhao.

    2) On inbound, ESN sequence numbers are already in network
    byte order. So don't try to convert it again, this fixes
    integrity verification for ESN. Fixes from Tobias Brunner.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    This is a large batch of Netfilter fixes for net, they are:

    1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist
    structure that allows to have several objects with the same key.
    Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is
    expecting a return value similar to memcmp(). Change location of
    the nat_bysource field in the nf_conn structure to avoid zeroing
    this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us
    to crashes. From Florian Westphal.

    2) Don't allow malformed fragments go through in IPv6, drop them,
    otherwise we hit GPF, patch from Florian Westphal.

    3) Fix crash if attributes are missing in nft_range, from Liping Zhang.

    4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia.

    5) Two patches from David Ahern to fix netfilter interaction with vrf.
    From David Ahern.

    6) Fix element timeout calculation in nf_tables, we take milliseconds
    from userspace, but we use jiffies from kernelspace. Patch from
    Anders K. Pedersen.

    7) Missing validation length netlink attribute for nft_hash, from
    Laura Garcia.

    8) Fix nf_conntrack_helper documentation, we don't default to off
    anymore for a bit of time so let's get this in sync with the code.

    I know is late but I think these are important, specifically the NAT
    bits, as they are mostly addressing fallout from recent changes. I also
    read there are chances to have -rc8, if that is the case, that would
    also give us a bit more time to test this.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Dec, 2016

1 commit

  • Since 09d9686047db ("netfilter: x_tables: do compat validation via
    translate_table"), it used compatr structure to assign newinfo
    structure. In translate_compat_table of ip_tables.c and ip6_tables.c,
    it used compatr->hook_entry to replace info->hook_entry and
    compatr->underflow to replace info->underflow, but not do the same
    replacement in arp_tables.c.

    It caused invoking 32-bit "arptbale -P INPUT ACCEPT" failed in 64bit
    kernel.
    --------------------------------------
    root@qemux86-64:~# arptables -P INPUT ACCEPT
    root@qemux86-64:~# arptables -P INPUT ACCEPT
    ERROR: Policy for `INPUT' offset 448 != underflow 0
    arptables: Incompatible with this kernel
    --------------------------------------

    Fixes: 09d9686047db ("netfilter: x_tables: do compat validation via translate_table")
    Signed-off-by: Hongxu Jia
    Acked-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Hongxu Jia
     

30 Nov, 2016

2 commits


29 Nov, 2016

1 commit


25 Nov, 2016

1 commit

  • In commits 93821778def10 ("udp: Fix rcv socket locking") and
    f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
    __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
    was forgotten.

    This leads to crashes if UDPlite header is pulled twice, which happens
    starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets
    before queueing")

    Bug found by syzkaller team, thanks a lot guys !

    Note that backlog use in UDP/UDPlite is scheduled to be removed starting
    from linux-4.10, so this patch is only needed up to linux-4.9

    Fixes: 93821778def1 ("udp: Fix rcv socket locking")
    Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Cc: Benjamin LaHaise
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Nov, 2016

1 commit

  • ip_route_me_harder is not considering the L3 domain and sending lookups
    to the wrong table. For example consider the following output rule:

    iptables -I OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset

    using perf to analyze lookups via the fib_table_lookup tracepoint shows:

    vrf-test 1187 [001] 46887.295927: fib:fib_table_lookup: table 255 oif 0 iif 0 src 0.0.0.0 dst 10.100.1.254 tos 0 scope 0 flags 0
    ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
    ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
    ffffffff8148dda3 __inet_dev_addr_type ([kernel.kallsyms])
    ffffffff8148ddf6 inet_addr_type ([kernel.kallsyms])
    ffffffff8149e344 ip_route_me_harder ([kernel.kallsyms])

    and

    vrf-test 1187 [001] 46887.295933: fib:fib_table_lookup: table 255 oif 0 iif 1 src 10.100.1.254 dst 10.100.1.2 tos 0 scope 0 flags
    ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
    ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
    ffffffff814998ff fib4_rule_action ([kernel.kallsyms])
    ffffffff81437f35 fib_rules_lookup ([kernel.kallsyms])
    ffffffff81499758 __fib_lookup ([kernel.kallsyms])
    ffffffff8144f010 fib_lookup.constprop.34 ([kernel.kallsyms])
    ffffffff8144f759 __ip_route_output_key_hash ([kernel.kallsyms])
    ffffffff8144fc6a ip_route_output_flow ([kernel.kallsyms])
    ffffffff8149e39b ip_route_me_harder ([kernel.kallsyms])

    In both cases the lookups are directed to table 255 rather than the
    table associated with the device via the L3 domain. Update both
    lookups to pull the L3 domain from the dst currently attached to the
    skb.

    Signed-off-by: David Ahern
    Signed-off-by: Pablo Neira Ayuso

    David Ahern
     

22 Nov, 2016

1 commit

  • We need to zero out the private data area when application switches
    connection to different algorithm (TCP_CONGESTION setsockopt).

    When congestion ops get assigned at connect time everything is already
    zeroed because sk_alloc uses GFP_ZERO flag. But in the setsockopt case
    this contains whatever previous cc placed there.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

17 Nov, 2016

2 commits

  • Fix a small memory leak that can occur where we leak a fib_alias in the
    event of us not being able to insert it into the local table.

    Fixes: 0ddcf43d5d4a0 ("ipv4: FIB Local/MAIN table collapse")
    Reported-by: Eric Dumazet
    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • The patch that removed the FIB offload infrastructure was a bit too
    aggressive and also removed code needed to clean up us splitting the table
    if additional rules were added. Specifically the function
    fib_trie_flush_external was called at the end of a new rule being added to
    flush the foreign trie entries from the main trie.

    I updated the code so that we only call fib_trie_flush_external on the main
    table so that we flush the entries for local from main. This way we don't
    call it for every rule change which is what was happening previously.

    Fixes: 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")
    Reported-by: Eric Dumazet
    Cc: Jiri Pirko
    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     

16 Nov, 2016

2 commits

  • Honor udptable parameter that is passed to __udp*_lib_mcast_deliver(),
    otherwise udplite broadcast/multicast use the wrong table and it breaks.

    Fixes: 2dc41cff7545 ("udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.")
    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • In commit 24cf3af3fed5 ("igmp: call ip_mc_clear_src..."), we forgot to remove
    igmpv3_clear_delrec() in ip_mc_down(), which also called ip_mc_clear_src().
    This make us clear all IGMPv3 source filter info after NETDEV_DOWN.
    Move igmpv3_clear_delrec() to ip_mc_destroy_dev() and then no need
    ip_mc_clear_src() in ip_mc_destroy_dev().

    On the other hand, we should restore back instead of free all source filter
    info in igmpv3_del_delrec(). Or we will not able to restore IGMPv3 source
    filter info after NETDEV_UP and NETDEV_POST_TYPE_CHANGE.

    Fixes: 24cf3af3fed5 ("igmp: call ip_mc_clear_src() only when ...")
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

14 Nov, 2016

2 commits

  • With syzkaller help, Marco Grassi found a bug in TCP stack,
    crashing in tcp_collapse()

    Root cause is that sk_filter() can truncate the incoming skb,
    but TCP stack was not really expecting this to happen.
    It probably was expecting a simple DROP or ACCEPT behavior.

    We first need to make sure no part of TCP header could be removed.
    Then we need to adjust TCP_SKB_CB(skb)->end_seq

    Many thanks to syzkaller team and Marco for giving us a reproducer.

    Signed-off-by: Eric Dumazet
    Reported-by: Marco Grassi
    Reported-by: Vladis Dronov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
    and then the state of the neigh for the new_gw is checked. If the state
    isn't valid then the redirected route is deleted. This behavior is
    maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
    is assigned to peer->redirect_learned.a4 before calling
    ipv4_neigh_lookup().

    After commit 5943634fc559 ("ipv4: Maintain redirect and PMTU info in
    struct rtable again."), ipv4_neigh_lookup() is performed without the
    rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
    isn't zero, the function uses it as the key. The neigh is most likely
    valid since the old_gw is the one that sends the ICMP redirect message.
    Then the new_gw is assigned to fib_nh_exception. The problem is: the
    new_gw ARP may never gets resolved and the traffic is blackholed.

    So, use the new_gw for neigh lookup.

    Changes from v1:
    - use __ipv4_neigh_lookup instead (per Eric Dumazet).

    Fixes: 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
    Signed-off-by: Stephen Suryaputra Lin
    Signed-off-by: David S. Miller

    Stephen Suryaputra Lin
     

11 Nov, 2016

1 commit