25 Jan, 2021

1 commit


17 Jan, 2021

1 commit

  • [ Upstream commit d8f5c29653c3f6995e8979be5623d263e92f6b86 ]

    Route removal is handled by two code paths. The main removal path is via
    fib6_del_route() which will handle purging any PMTU exceptions from the
    cache, removing all per-cpu copies of the DST entry used by the route, and
    releasing the fib6_info struct.

    The second removal location is during fib6_add_rt2node() during a route
    replacement operation. This path also calls fib6_purge_rt() to handle
    cleaning up the per-cpu copies of the DST entries and releasing the
    fib6_info associated with the older route, but it does not flush any PMTU
    exceptions that the older route had. Since the older route is removed from
    the tree during the replacement, we lose any way of accessing it again.

    As these lingering DSTs and the fib6_info struct are holding references to
    the underlying netdevice struct as well, unregistering that device from the
    kernel can never complete.

    Fixes: 2b760fcf5cfb3 ("ipv6: hook up exception table to store dst cache")
    Signed-off-by: Sean Tranchetti
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Sean Tranchetti
     

13 Jan, 2021

1 commit

  • commit 443d6e86f821a165fae3fc3fc13086d27ac140b1 upstream.

    This fixes the dereference to fetch the RCU pointer when holding
    the appropriate xtables lock.

    Reported-by: kernel test robot
    Fixes: cc00bcaa5899 ("netfilter: x_tables: Switch synchronization to RCU")
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Subash Abhinov Kasiviswanathan
     

11 Dec, 2020

1 commit


10 Dec, 2020

2 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
    from Subash Abhinov Kasiviswanathan.

    2) Fix netlink dump of dynset timeouts later than 23 days.

    3) Add comment for the indirect serialization of the nft commit mutex
    with rtnl_mutex.

    4) Remove bogus check for confirmed conntrack when matching on the
    conntrack ID, from Brett Mastbergen.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • For DCTCP, we have to retain the ECT bits set by the congestion control
    algorithm on the socket when reflecting syn TOS in syn-ack, in order to
    make ECN work properly.

    Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket")
    Reported-by: Alexander Duyck
    Signed-off-by: Wei Wang
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

09 Dec, 2020

1 commit


08 Dec, 2020

1 commit

  • When running concurrent iptables rules replacement with data, the per CPU
    sequence count is checked after the assignment of the new information.
    The sequence count is used to synchronize with the packet path without the
    use of any explicit locking. If there are any packets in the packet path using
    the table information, the sequence count is incremented to an odd value and
    is incremented to an even after the packet process completion.

    The new table value assignment is followed by a write memory barrier so every
    CPU should see the latest value. If the packet path has started with the old
    table information, the sequence counter will be odd and the iptables
    replacement will wait till the sequence count is even prior to freeing the
    old table info.

    However, this assumes that the new table information assignment and the memory
    barrier is actually executed prior to the counter check in the replacement
    thread. If CPU decides to execute the assignment later as there is no user of
    the table information prior to the sequence check, the packet path in another
    CPU may use the old table information. The replacement thread would then free
    the table information under it leading to a use after free in the packet
    processing context-

    Unable to handle kernel NULL pointer dereference at virtual
    address 000000000000008e
    pc : ip6t_do_table+0x5d0/0x89c
    lr : ip6t_do_table+0x5b8/0x89c
    ip6t_do_table+0x5d0/0x89c
    ip6table_filter_hook+0x24/0x30
    nf_hook_slow+0x84/0x120
    ip6_input+0x74/0xe0
    ip6_rcv_finish+0x7c/0x128
    ipv6_rcv+0xac/0xe4
    __netif_receive_skb+0x84/0x17c
    process_backlog+0x15c/0x1b8
    napi_poll+0x88/0x284
    net_rx_action+0xbc/0x23c
    __do_softirq+0x20c/0x48c

    This could be fixed by forcing instruction order after the new table
    information assignment or by switching to RCU for the synchronization.

    Fixes: 80055dab5de0 ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
    Reported-by: Sean Tranchetti
    Reported-by: kernel test robot
    Suggested-by: Florian Westphal
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Signed-off-by: Pablo Neira Ayuso

    Subash Abhinov Kasiviswanathan
     

03 Dec, 2020

1 commit

  • syzkaller managed to crash the kernel using an NBMA ip6gre interface. I
    could reproduce it creating an NBMA ip6gre interface and forwarding
    traffic to it:

    skbuff: skb_under_panic: text:ffffffff8250e927 len:148 put:44 head:ffff8c03c7a33
    ------------[ cut here ]------------
    kernel BUG at net/core/skbuff.c:109!
    Call Trace:
    skb_push+0x10/0x10
    ip6gre_header+0x47/0x1b0
    neigh_connected_output+0xae/0xf0

    ip6gre tunnel provides its own header_ops->create, and sets it
    conditionally when initializing the tunnel in NBMA mode. When
    header_ops->create is used, dev->hard_header_len should reflect the
    length of the header created. Otherwise, when not used,
    dev->needed_headroom should be used.

    Fixes: eb95f52fc72d ("net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap")
    Cc: Maria Pasechnik
    Signed-off-by: Antoine Tenart
    Link: https://lore.kernel.org/r/20201130161911.464106-1-atenart@kernel.org
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     

28 Nov, 2020

1 commit


26 Nov, 2020

1 commit

  • kmemleak report a memory leak as follows:

    unreferenced object 0xffff8880059c6a00 (size 64):
    comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s)
    hex dump (first 32 bytes):
    20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00 ...............
    1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00 ................
    backtrace:
    [] ip6addrlbl_add+0x90/0xbb0
    [] ip6addrlbl_net_init+0x109/0x170
    [] ops_init+0xa8/0x3c0
    [] setup_net+0x2de/0x7e0
    [] copy_net_ns+0x27d/0x530
    [] create_new_namespaces+0x382/0xa30
    [] unshare_nsproxy_namespaces+0xa1/0x1d0
    [] ksys_unshare+0x3a4/0x780
    [] __x64_sys_unshare+0x2d/0x40
    [] do_syscall_64+0x33/0x40
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    We should free all rules when we catch an error in ip6addrlbl_net_init().
    otherwise a memory leak will occur.

    Fixes: 2a8cc6c89039 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
    Reported-by: Hulk Robot
    Signed-off-by: Wang Hai
    Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.com
    Signed-off-by: Jakub Kicinski

    Wang Hai
     

25 Nov, 2020

1 commit

  • When a BPF program is used to select between a type of TCP congestion
    control algorithm that uses either ECN or not there is a case where the
    synack for the frame was coming up without the ECT0 bit set. A bit of
    research found that this was due to the final socket being configured to
    dctcp while the listener socket was staying in cubic.

    To reproduce it all that is needed is to monitor TCP traffic while running
    the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
    assuming tcp_dctcp module is loaded or compiled in and the traffic matches
    the rules in the sample file, is that for all frames with the exception of
    the synack the ECT0 bit is set.

    To address that it is necessary to make one additional call to
    tcp_bpf_ca_needs_ecn using the request socket and then use the output of
    that to set the ECT0 bit for the tos/tclass of the packet.

    Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    Signed-off-by: Alexander Duyck
    Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
    Signed-off-by: Jakub Kicinski

    Alexander Duyck
     

24 Nov, 2020

1 commit

  • When the TCP stack is in SYN flood mode, the server child socket is
    created from the SYN cookie received in a TCP packet with the ACK flag
    set.

    The child socket is created when the server receives the first TCP
    packet with a valid SYN cookie from the client. Usually, this packet
    corresponds to the final step of the TCP 3-way handshake, the ACK
    packet. But is also possible to receive a valid SYN cookie from the
    first TCP data packet sent by the client, and thus create a child socket
    from that SYN cookie.

    Since a client socket is ready to send data as soon as it receives the
    SYN+ACK packet from the server, the client can send the ACK packet (sent
    by the TCP stack code), and the first data packet (sent by the userspace
    program) almost at the same time, and thus the server will equally
    receive the two TCP packets with valid SYN cookies almost at the same
    instant.

    When such event happens, the TCP stack code has a race condition that
    occurs between the momement a lookup is done to the established
    connections hashtable to check for the existence of a connection for the
    same client, and the moment that the child socket is added to the
    established connections hashtable. As a consequence, this race condition
    can lead to a situation where we add two child sockets to the
    established connections hashtable and deliver two sockets to the
    userspace program to the same client.

    This patch fixes the race condition by checking if an existing child
    socket exists for the same client when we are adding the second child
    socket to the established connections socket. If an existing child
    socket exists, we drop the packet and discard the second child socket
    to the same client.

    Signed-off-by: Ricardo Dias
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
    Signed-off-by: Jakub Kicinski

    Ricardo Dias
     

21 Nov, 2020

1 commit

  • An issue was recently found where DCTCP SYN/ACK packets did not have the
    ECT bit set in the L3 header. A bit of code review found that the recent
    change referenced below had gone though and added a mask that prevented the
    ECN bits from being populated in the L3 header.

    This patch addresses that by rolling back the mask so that it is only
    applied to the flags coming from the incoming TCP request instead of
    applying it to the socket tos/tclass field. Doing this the ECT bits were
    restored in the SYN/ACK packets in my testing.

    One thing that is not addressed by this patch set is the fact that
    tcp_reflect_tos appears to be incompatible with ECN based congestion
    avoidance algorithms. At a minimum the feature should likely be documented
    which it currently isn't.

    Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket")
    Signed-off-by: Alexander Duyck
    Acked-by: Wei Wang
    Signed-off-by: Jakub Kicinski

    Alexander Duyck
     

20 Nov, 2020

2 commits

  • …nux/kernel/git/netdev/net") into android-mainline

    Steps on the way to 5.10-rc5

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: I00726ee0d08f08ae6ac5edd07c8fa502b41d4800

    Greg Kroah-Hartman
     
  • IPV6=m
    NF_DEFRAG_IPV6=y

    ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function
    `nf_ct_frag6_gather':
    net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to
    `ipv6_frag_thdr_truncated'

    Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This
    dependency is forcing IPV6=y.

    Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This
    is the same solution as used with a similar issues: Referring to
    commit 70b095c843266 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6
    module")

    Fixes: 9d9e937b1c8b ("ipv6/netfilter: Discard first fragment not including all headers")
    Reported-by: Randy Dunlap
    Reported-by: kernel test robot
    Signed-off-by: Georg Kohmann
    Acked-by: Pablo Neira Ayuso
    Acked-by: Randy Dunlap # build-tested
    Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.com
    Signed-off-by: Jakub Kicinski

    Georg Kohmann
     

19 Nov, 2020

1 commit

  • Fix to return a negative error code from the error handling
    case instead of 0, as done elsewhere in this function.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: Hulk Robot
    Signed-off-by: Zhang Changzhong
    Link: https://lore.kernel.org/r/1605581105-35295-1-git-send-email-zhangchangzhong@huawei.com
    Signed-off-by: Jakub Kicinski

    Zhang Changzhong
     

17 Nov, 2020

1 commit

  • Packets are processed even though the first fragment don't include all
    headers through the upper layer header. This breaks TAHI IPv6 Core
    Conformance Test v6LC.1.3.6.

    Referring to RFC8200 SECTION 4.5: "If the first fragment does not include
    all headers through an Upper-Layer header, then that fragment should be
    discarded and an ICMP Parameter Problem, Code 3, message should be sent to
    the source of the fragment, with the Pointer field set to zero."

    The fragment needs to be validated the same way it is done in
    commit 2efdaaaf883a ("IPv6: reply ICMP error if the first fragment don't
    include all headers") for ipv6. Wrap the validation into a common function,
    ipv6_frag_thdr_truncated() to check for truncation in the upper layer
    header. This validation does not fullfill all aspects of RFC 8200,
    section 4.5, but is at the moment sufficient to pass mentioned TAHI test.

    In netfilter, utilize the fragment offset returned by find_prev_fhdr() to
    let ipv6_frag_thdr_truncated() start it's traverse from the fragment
    header.

    Return 0 to drop the fragment in the netfilter. This is the same behaviour
    as used on other protocol errors in this function, e.g. when
    nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up
    by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an
    appropriate ICMP Parameter Problem message back to the source.

    References commit 2efdaaaf883a ("IPv6: reply ICMP error if the first
    fragment don't include all headers")

    Signed-off-by: Georg Kohmann
    Acked-by: Pablo Neira Ayuso
    Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.com
    Signed-off-by: Jakub Kicinski

    Georg Kohmann
     

14 Nov, 2020

2 commits

  • genlmsg_cancel() needs to be called in the error path of
    inet6_fill_ifmcaddr and inet6_fill_ifacaddr to cancel
    the message.

    Fixes: 6ecf4c37eb3e ("ipv6: enable IFA_TARGET_NETNSID for RTM_GETADDR")
    Reported-by: Hulk Robot
    Signed-off-by: Zhang Qilong
    Link: https://lore.kernel.org/r/20201112080950.1476302-1-zhangqilong3@huawei.com
    Signed-off-by: Jakub Kicinski

    Zhang Qilong
     
  • Commit 58956317c8de ("neighbor: Improve garbage collection")
    guarantees neighbour table entries a five-second lifetime. Processes
    which make heavy use of multicast can fill the neighour table with
    multicast addresses in five seconds. At that point, neighbour entries
    can't be GC-ed because they aren't five seconds old yet, the kernel
    log starts to fill up with "neighbor table overflow!" messages, and
    sends start to fail.

    This patch allows multicast addresses to be thrown out before they've
    lived out their five seconds. This makes room for non-multicast
    addresses and makes messages to all addresses more reliable in these
    circumstances.

    Fixes: 58956317c8de ("neighbor: Improve garbage collection")
    Signed-off-by: Jeff Dike
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/20201113015815.31397-1-jdike@akamai.com
    Signed-off-by: Jakub Kicinski

    Jeff Dike
     

13 Nov, 2020

2 commits

  • …cm/fs/fscrypt/fscrypt") into android-mainline

    Steps on the way to 5.10-rc4

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: I8554ba37704bee02192ff6117d4909fde568fca2

    Greg Kroah-Hartman
     
  • udp{4,6}_lib_lookup_skb() use ip{,v6}_hdr() to get IP header of the
    packet. While it's probably OK for non-frag0 paths, this helpers
    will also point to junk on Fast/frag0 GRO when all headers are
    located in frags. As a result, sk/skb lookup may fail or give wrong
    results. To support both GRO modes, skb_gro_network_header() might
    be used. To not modify original functions, add private versions of
    udp{4,6}_lib_lookup_skb() only to perform correct sk lookups on GRO.

    Present since the introduction of "application-level" UDP GRO
    in 4.7-rc1.

    Misc: replace totally unneeded ternaries with plain ifs.

    Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket")
    Suggested-by: Willem de Bruijn
    Cc: Eric Dumazet
    Signed-off-by: Alexander Lobakin
    Acked-by: Willem de Bruijn
    Signed-off-by: Jakub Kicinski

    Alexander Lobakin
     

11 Nov, 2020

1 commit

  • When net.ipv4.tcp_syncookies=1 and syn flood is happened,
    cookie_v4_check or cookie_v6_check tries to redo what
    tcp_v4_send_synack or tcp_v6_send_synack did,
    rsk_window_clamp will be changed if SOCK_RCVBUF is set,
    which will make rcv_wscale is different, the client
    still operates with initial window scale and can overshot
    granted window, the client use the initial scale but local
    server use new scale to advertise window value, and session
    work abnormally.

    Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
    Signed-off-by: Mao Wenan
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/1604967391-123737-1-git-send-email-wenan.mao@linux.alibaba.com
    Signed-off-by: Jakub Kicinski

    Mao Wenan
     

10 Nov, 2020

1 commit

  • Due to the legacy usage of hard_header_len for SIT tunnels while
    already using infrastructure from net/ipv4/ip_tunnel.c the
    calculation of the path MTU in tnl_update_pmtu is incorrect.
    This leads to unnecessary creation of MTU exceptions for any
    flow going over a SIT tunnel.

    As SIT tunnels do not have a header themsevles other than their
    transport (L3, L2) headers we're leaving hard_header_len set to zero
    as tnl_update_pmtu is already taking care of the transport headers
    sizes.

    This will also help avoiding unnecessary IPv6 GC runs and spinlock
    contention seen when using SIT tunnels and for more than
    net.ipv6.route.gc_thresh flows.

    Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
    Signed-off-by: Oliver Herms
    Acked-by: Willem de Bruijn
    Link: https://lore.kernel.org/r/20201103104133.GA1573211@tws
    Signed-off-by: Jakub Kicinski

    Oliver Herms
     

09 Nov, 2020

1 commit


05 Nov, 2020

1 commit


01 Nov, 2020

2 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Incorrect netlink report logic in flowtable and genID.

    2) Add a selftest to check that wireguard passes the right sk
    to ip_route_me_harder, from Jason A. Donenfeld.

    3) Pass the actual sk to ip_route_me_harder(), also from Jason.

    4) Missing expression validation of updates via nft --check.

    5) Update byte and packet counters regardless of whether they
    match, from Stefano Brivio.
    ====================

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • Based on RFC 8200, Section 4.5 Fragment Header:

    - If the first fragment does not include all headers through an
    Upper-Layer header, then that fragment should be discarded and
    an ICMP Parameter Problem, Code 3, message should be sent to
    the source of the fragment, with the Pointer field set to zero.

    Checking each packet header in IPv6 fast path will have performance impact,
    so I put the checking in ipv6_frag_rcv().

    As the packet may be any kind of L4 protocol, I only checked some common
    protocols' header length and handle others by (offset + 1) > skb->len.
    Also use !(frag_off & htons(IP6_OFFSET)) to catch atomic fragments
    (fragmented packet with only one fragment).

    When send ICMP error message, if the 1st truncated fragment is ICMP message,
    icmp6_send() will break as is_ineligible() return true. So I added a check
    in is_ineligible() to let fragment packet with nexthdr ICMP but no ICMP header
    return false.

    Signed-off-by: Hangbin Liu
    Signed-off-by: Jakub Kicinski

    Hangbin Liu
     

30 Oct, 2020

2 commits

  • ip6_tnl_encap assigns to proto transport protocol which
    encapsulates inner packet, but we must pass to set_inner_ipproto
    protocol of that inner packet.

    Calling set_inner_ipproto after ip6_tnl_encap might break gso.
    For example, in case of encapsulating ipv6 packet in fou6 packet, inner_ipproto
    would be set to IPPROTO_UDP instead of IPPROTO_IPV6. This would lead to
    incorrect calling sequence of gso functions:
    ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> udp6_ufo_fragment
    instead of:
    ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> ip6ip6_gso_segment

    Fixes: 6c11fbf97e69 ("ip6_tunnel: add MPLS transmit support")
    Signed-off-by: Alexander Ovechkin
    Link: https://lore.kernel.org/r/20201029171012.20904-1-ovov@yandex-team.ru
    Signed-off-by: Jakub Kicinski

    Alexander Ovechkin
     
  • If netfilter changes the packet mark when mangling, the packet is
    rerouted using the route_me_harder set of functions. Prior to this
    commit, there's one big difference between route_me_harder and the
    ordinary initial routing functions, described in the comment above
    __ip_queue_xmit():

    /* Note: skb->sk can be different from sk, in case of tunnels */
    int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,

    That function goes on to correctly make use of sk->sk_bound_dev_if,
    rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
    tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
    It will make some transformations to that packet, and then it will send
    the encapsulated packet out of a *new* socket. That new socket will
    basically always have a different sk_bound_dev_if (otherwise there'd be
    a routing loop). So for the purposes of routing the encapsulated packet,
    the routing information as it pertains to the socket should come from
    that socket's sk, rather than the packet's original skb->sk. For that
    reason __ip_queue_xmit() and related functions all do the right thing.

    One might argue that all tunnels should just call skb_orphan(skb) before
    transmitting the encapsulated packet into the new socket. But tunnels do
    *not* do this -- and this is wisely avoided in skb_scrub_packet() too --
    because features like TSQ rely on skb->destructor() being called when
    that buffer space is truely available again. Calling skb_orphan(skb) too
    early would result in buffers filling up unnecessarily and accounting
    info being all wrong. Instead, additional routing must take into account
    the new sk, just as __ip_queue_xmit() notes.

    So, this commit addresses the problem by fishing the correct sk out of
    state->sk -- it's already set properly in the call to nf_hook() in
    __ip_local_out(), which receives the sk as part of its normal
    functionality. So we make sure to plumb state->sk through the various
    route_me_harder functions, and then make correct use of it following the
    example of __ip_queue_xmit().

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Jason A. Donenfeld
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jason A. Donenfeld
     

27 Oct, 2020

1 commit


26 Oct, 2020

1 commit


21 Oct, 2020

1 commit


20 Oct, 2020

1 commit

  • Fragmented ndisc packets assembled in netfilter not dropped as specified
    in RFC 6980, section 5. This behaviour breaks TAHI IPv6 Core Conformance
    Tests v6LC.2.1.22/23, V6LC.2.2.26/27 and V6LC.2.3.18.

    Setting IP6SKB_FRAGMENTED flag during reassembly.

    References: commit b800c3b966bc ("ipv6: drop fragmented ndisc packets by default (RFC 6980)")
    Signed-off-by: Georg Kohmann
    Signed-off-by: Pablo Neira Ayuso

    Georg Kohmann
     

16 Oct, 2020

3 commits

  • Pull networking updates from Jakub Kicinski:

    - Add redirect_neigh() BPF packet redirect helper, allowing to limit
    stack traversal in common container configs and improving TCP
    back-pressure.

    Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

    - Expand netlink policy support and improve policy export to user
    space. (Ge)netlink core performs request validation according to
    declared policies. Expand the expressiveness of those policies
    (min/max length and bitmasks). Allow dumping policies for particular
    commands. This is used for feature discovery by user space (instead
    of kernel version parsing or trial and error).

    - Support IGMPv3/MLDv2 multicast listener discovery protocols in
    bridge.

    - Allow more than 255 IPv4 multicast interfaces.

    - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
    packets of TCPv6.

    - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
    multiple subflows in a load balancing scenario. Enhance advertising
    addresses via the RM_ADDR/ADD_ADDR options.

    - Support SMC-Dv2 version of SMC, which enables multi-subnet
    deployments.

    - Allow more calls to same peer in RxRPC.

    - Support two new Controller Area Network (CAN) protocols - CAN-FD and
    ISO 15765-2:2016.

    - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
    kernel problem.

    - Add TC actions for implementing MPLS L2 VPNs.

    - Improve nexthop code - e.g. handle various corner cases when nexthop
    objects are removed from groups better, skip unnecessary
    notifications and make it easier to offload nexthops into HW by
    converting to a blocking notifier.

    - Support adding and consuming TCP header options by BPF programs,
    opening the doors for easy experimental and deployment-specific TCP
    option use.

    - Reorganize TCP congestion control (CC) initialization to simplify
    life of TCP CC implemented in BPF.

    - Add support for shipping BPF programs with the kernel and loading
    them early on boot via the User Mode Driver mechanism, hence reusing
    all the user space infra we have.

    - Support sleepable BPF programs, initially targeting LSM and tracing.

    - Add bpf_d_path() helper for returning full path for given 'struct
    path'.

    - Make bpf_tail_call compatible with bpf-to-bpf calls.

    - Allow BPF programs to call map_update_elem on sockmaps.

    - Add BPF Type Format (BTF) support for type and enum discovery, as
    well as support for using BTF within the kernel itself (current use
    is for pretty printing structures).

    - Support listing and getting information about bpf_links via the bpf
    syscall.

    - Enhance kernel interfaces around NIC firmware update. Allow
    specifying overwrite mask to control if settings etc. are reset
    during update; report expected max time operation may take to users;
    support firmware activation without machine reboot incl. limits of
    how much impact reset may have (e.g. dropping link or not).

    - Extend ethtool configuration interface to report IEEE-standard
    counters, to limit the need for per-vendor logic in user space.

    - Adopt or extend devlink use for debug, monitoring, fw update in many
    drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
    dpaa2-eth).

    - In mlxsw expose critical and emergency SFP module temperature alarms.
    Refactor port buffer handling to make the defaults more suitable and
    support setting these values explicitly via the DCBNL interface.

    - Add XDP support for Intel's igb driver.

    - Support offloading TC flower classification and filtering rules to
    mscc_ocelot switches.

    - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
    fixed interval period pulse generator and one-step timestamping in
    dpaa-eth.

    - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
    offload.

    - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
    this HW to use it. Convert mvpp2 to split PCS.

    - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
    7-port Mediatek MT7531 IP.

    - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
    and wcn3680 support in wcn36xx.

    - Improve performance for packets which don't require much offloads on
    recent Mellanox NICs by 20% by making multiple packets share a
    descriptor entry.

    - Move chelsio inline crypto drivers (for TLS and IPsec) from the
    crypto subtree to drivers/net. Move MDIO drivers out of the phy
    directory.

    - Clean up a lot of W=1 warnings, reportedly the actively developed
    subsections of networking drivers should now build W=1 warning free.

    - Make sure drivers don't use in_interrupt() to dynamically adapt their
    code. Convert tasklets to use new tasklet_setup API (sadly this
    conversion is not yet complete).

    * tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
    Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
    net, sockmap: Don't call bpf_prog_put() on NULL pointer
    bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
    bpf, sockmap: Add locking annotations to iterator
    netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
    net: fix pos incrementment in ipv6_route_seq_next
    net/smc: fix invalid return code in smcd_new_buf_create()
    net/smc: fix valid DMBE buffer sizes
    net/smc: fix use-after-free of delayed events
    bpfilter: Fix build error with CONFIG_BPFILTER_UMH
    cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
    net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
    bpf: Fix register equivalence tracking.
    rxrpc: Fix loss of final ack on shutdown
    rxrpc: Fix bundle counting for exclusive connections
    netfilter: restore NF_INET_NUMHOOKS
    ibmveth: Identify ingress large send packets.
    ibmveth: Switch order of ibmveth_helper calls.
    cxgb4: handle 4-tuple PEDIT to NAT mode translation
    selftests: Add VRF route leaking tests
    ...

    Linus Torvalds
     
  • Minor conflicts in net/mptcp/protocol.h and
    tools/testing/selftests/net/Makefile.

    In both cases code was added on both sides in the same place
    so just keep both.

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • Commit 4fc427e05158 ("ipv6_route_seq_next should increase position index")
    tried to fix the issue where seq_file pos is not increased
    if a NULL element is returned with seq_ops->next(). See bug
    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    The commit effectively does:
    - increase pos for all seq_ops->start()
    - increase pos for all seq_ops->next()

    For ipv6_route, increasing pos for all seq_ops->next() is correct.
    But increasing pos for seq_ops->start() is not correct
    since pos is used to determine how many items to skip during
    seq_ops->start():
    iter->skip = *pos;
    seq_ops->start() just fetches the *current* pos item.
    The item can be skipped only after seq_ops->show() which essentially
    is the beginning of seq_ops->next().

    For example, I have 7 ipv6 route entries,
    root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=4096
    00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001 eth0
    fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001 eth0
    00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
    00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001 lo
    fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001 eth0
    ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000004 00000000 00000001 eth0
    00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
    0+1 records in
    0+1 records out
    1050 bytes (1.0 kB, 1.0 KiB) copied, 0.00707908 s, 148 kB/s
    root@arch-fb-vm1:~/net-next

    In the above, I specify buffer size 4096, so all records can be returned
    to user space with a single trip to the kernel.

    If I use buffer size 128, since each record size is 149, internally
    kernel seq_read() will read 149 into its internal buffer and return the data
    to user space in two read() syscalls. Then user read() syscall will trigger
    next seq_ops->start(). Since the current implementation increased pos even
    for seq_ops->start(), it will skip record #2, #4 and #6, assuming the first
    record is #1.

    root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=128
    00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001 eth0
    00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
    fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001 eth0
    00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
    4+1 records in
    4+1 records out
    600 bytes copied, 0.00127758 s, 470 kB/s

    To fix the problem, create a fake pos pointer so seq_ops->start()
    won't actually increase seq_file pos. With this fix, the
    above `dd` command with `bs=128` will show correct result.

    Fixes: 4fc427e05158 ("ipv6_route_seq_next should increase position index")
    Cc: Alexei Starovoitov
    Suggested-by: Vasily Averin
    Reviewed-by: Vasily Averin
    Signed-off-by: Yonghong Song
    Acked-by: Martin KaFai Lau
    Acked-by: Andrii Nakryiko
    Signed-off-by: Jakub Kicinski

    Yonghong Song
     

15 Oct, 2020

1 commit

  • As per RFC4443, the destination address field for ICMPv6 error messages
    is copied from the source address field of the invoking packet.

    In configurations with Virtual Routing and Forwarding tables, looking up
    which routing table to use for sending ICMPv6 error messages is
    currently done by using the destination net_device.

    If the source and destination interfaces are within separate VRFs, or
    one in the global routing table and the other in a VRF, looking up the
    source address of the invoking packet in the destination interface's
    routing table will fail if the destination interface's routing table
    contains no route to the invoking packet's source address.

    One observable effect of this issue is that traceroute6 does not work in
    the following cases:

    - Route leaking between global routing table and VRF
    - Route leaking between VRFs

    Use the source device routing table when sending ICMPv6 error
    messages.

    [ In the context of ipv4, it has been pointed out that a similar issue
    may exist with ICMP errors triggered when forwarding between network
    namespaces. It would be worthwhile to investigate whether ipv6 has
    similar issues, but is outside of the scope of this investigation. ]

    [ Testing shows that similar issues exist with ipv6 unreachable /
    fragmentation needed messages. However, investigation of this
    additional failure mode is beyond this investigation's scope. ]

    Link: https://tools.ietf.org/html/rfc4443
    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski

    Mathieu Desnoyers
     

14 Oct, 2020

2 commits

  • Replace commas with semicolons. Commas introduce unnecessary
    variability in the code structure and are hard to see. What is done
    is essentially described by the following Coccinelle semantic patch
    (http://coccinelle.lip6.fr/):

    //
    @@ expression e1,e2; @@
    e1
    -,
    +;
    e2
    ... when any
    //

    Signed-off-by: Julia Lawall
    Acked-by: Paul Moore
    Link: https://lore.kernel.org/r/1602412498-32025-5-git-send-email-Julia.Lawall@inria.fr
    Signed-off-by: Jakub Kicinski

    Julia Lawall
     
  • Dump vlan tag and proto for the usual vlan offload case if the
    NF_LOG_MACDECODE flag is set on. Without this information the logging is
    misleading as there is no reference to the VLAN header.

    [12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0
    [12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1

    Fixes: 83e96d443b37 ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso