17 Jan, 2021

5 commits

  • [ Upstream commit b19218b27f3477316d296e8bcf4446aaf017aa69 ]

    The function nh_check_attr_group() is called to validate nexthop groups.
    The intention of that code seems to have been to bounce all attributes
    above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
    these attributes except when NHA_FDB attribute is present--then it accepts
    them.

    NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
    bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
    back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.

    But that still leaves NHA_GATEWAY as an attribute that would be accepted in
    FDB nexthop groups (with no meaning), so long as it keeps the address
    family as unspecified:

    # ip nexthop add id 1 fdb via 127.0.0.1
    # ip nexthop add id 10 fdb via default group 1

    The nexthop code is still relatively new and likely not used very broadly,
    and the FDB bits are newer still. Even though there is a reproducer out
    there, it relies on an improbable gateway arguments "via default", "via
    all" or "via any". Given all this, I believe it is OK to reformulate the
    condition to do the right thing and bounce NHA_GATEWAY.

    Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops")
    Signed-off-by: Petr Machata
    Signed-off-by: Ido Schimmel
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Petr Machata
     
  • [ Upstream commit 7b01e53eee6dce7a8a6736e06b99b68cd0cc7a27 ]

    In case of error, remove the nexthop group entry from the list to which
    it was previously added.

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Petr Machata
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit 07e61a979ca4dddb3661f59328b3cd109f6b0070 ]

    A reference was not taken for the current nexthop entry, so do not try
    to put it in the error path.

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Petr Machata
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit bb4cc1a18856a73f0ff5137df0c2a31f4c50f6cf ]

    Conntrack reassembly records the largest fragment size seen in IPCB.
    However, when this gets forwarded/transmitted, fragmentation will only
    be forced if one of the fragmented packets had the DF bit set.

    In that case, a flag in IPCB will force fragmentation even if the
    MTU is large enough.

    This should work fine, but this breaks with ip tunnels.
    Consider client that sends a UDP datagram of size X to another host.

    The client fragments the datagram, so two packets, of size y and z, are
    sent. DF bit is not set on any of these packets.

    Middlebox netfilter reassembles those packets back to single size-X
    packet, before routing decision.

    packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
    isn't set. At output time, ip refragmentation is skipped as well
    because x is still smaller than the mtu of the output device.

    If ttransmit device is an ip tunnel, the packet size increases to
    x+overhead.

    Also, tunnel might be configured to force DF bit on outer header.

    In this case, packet will be dropped (exceeds MTU) and an ICMP error is
    generated back to sender.

    But sender already respects the announced MTU, all the packets that
    it sent did fit the announced mtu.

    Force refragmentation as per original sizes unconditionally so ip tunnel
    will encapsulate the fragments instead.

    The only other solution I see is to place ip refragmentation in
    the ip_tunnel code to handle this case.

    Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet")
    Reported-by: Christian Perle
    Signed-off-by: Florian Westphal
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     
  • [ Upstream commit 50c661670f6a3908c273503dfa206dfc7aa54c07 ]

    For some reason ip_tunnel insist on setting the DF bit anyway when the
    inner header has the DF bit set, EVEN if the tunnel was configured with
    'nopmtudisc'.

    This means that the script added in the previous commit
    cannot be made to work by adding the 'nopmtudisc' flag to the
    ip tunnel configuration. Doing so breaks connectivity even for the
    without-conntrack/netfilter scenario.

    When nopmtudisc is set, the tunnel will skip the mtu check, so no
    icmp error is sent to client. Then, because inner header has DF set,
    the outer header gets added with DF bit set as well.

    IP stack then sends an error to itself because the packet exceeds
    the device MTU.

    Fixes: 23a3647bc4f93 ("ip_tunnels: Use skb-len to PMTU check.")
    Cc: Stefano Brivio
    Signed-off-by: Florian Westphal
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     

13 Jan, 2021

3 commits

  • commit 443d6e86f821a165fae3fc3fc13086d27ac140b1 upstream.

    This fixes the dereference to fetch the RCU pointer when holding
    the appropriate xtables lock.

    Reported-by: kernel test robot
    Fixes: cc00bcaa5899 ("netfilter: x_tables: Switch synchronization to RCU")
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Subash Abhinov Kasiviswanathan
     
  • [ Upstream commit 085c7c4e1c0e50d90b7d90f61a12e12b317a91e2 ]

    Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
    have an erspan header. So the check in gre_parse_header() is wrong,
    we have to distinguish version 1 from version 0.

    We can just check the gre header length like is_erspan_type1().

    Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup")
    Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
    Cc: William Tu
    Cc: Lorenzo Bianconi
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit 21fdca22eb7df2a1e194b8adb812ce370748b733 ]

    RT_TOS() only clears one of the ECN bits. Therefore, when
    fib_compute_spec_dst() resorts to a fib lookup, it can return
    different results depending on the value of the second ECN bit.

    For example, ECT(0) and ECT(1) packets could be treated differently.

    $ ip netns add ns0
    $ ip netns add ns1
    $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
    $ ip -netns ns0 link set dev lo up
    $ ip -netns ns1 link set dev lo up
    $ ip -netns ns0 link set dev veth01 up
    $ ip -netns ns1 link set dev veth10 up

    $ ip -netns ns0 address add 192.0.2.10/24 dev veth01
    $ ip -netns ns1 address add 192.0.2.11/24 dev veth10

    $ ip -netns ns1 address add 192.0.2.21/32 dev lo
    $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
    $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0

    With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
    (ping uses -Q to set all TOS and ECN bits):

    $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
    [...]
    64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms

    But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
    because the "tos 4" route isn't matched:

    $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
    [...]
    64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms

    After this patch the ECN bits don't affect the result anymore:

    $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
    [...]
    64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms

    Fixes: 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper.")
    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Guillaume Nault
     

26 Dec, 2020

1 commit

  • commit c9f64d1fc101c64ea2be1b2e562b4395127befc9 upstream.

    When dumping the name and NTP servers advertised by DHCP, a blank line
    is emitted if either of the lists is empty. This can lead to confusing
    issues such as the blank line getting flagged as warning. This happens
    because the blank line is the result of pr_cont("\n") and that may see
    its level corrupted by some other driver concurrently writing to the
    console.

    Fix this by making sure that the terminating newline is only emitted
    if at least one entry in the lists was printed before.

    Reported-by: Jon Hunter
    Signed-off-by: Thierry Reding
    Link: https://lore.kernel.org/r/20201110073757.1284594-1-thierry.reding@gmail.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Thierry Reding
     

10 Dec, 2020

3 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
    from Subash Abhinov Kasiviswanathan.

    2) Fix netlink dump of dynset timeouts later than 23 days.

    3) Add comment for the indirect serialization of the nft commit mutex
    with rtnl_mutex.

    4) Remove bogus check for confirmed conntrack when matching on the
    conntrack ID, from Brett Mastbergen.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
    into persistent scenarios where we have the following sequence:

    (1) ACK for full-sized skb of N*MSS arrives
    -> tcp_write_xmit() transmit full-sized skb with N*MSS
    -> move pacing release time forward
    -> exit tcp_write_xmit() because pacing time is in the future

    (2) TSQ callback or TCP internal pacing timer fires
    -> try to transmit next skb, but TSO deferral finds remainder of
    available cwnd is not big enough to trigger an immediate send
    now, so we defer sending until the next ACK.

    (3) repeat...

    So we can get into a case where we never mark ourselves as
    cwnd-limited for many seconds at a time, even with
    bulk/infinite-backlog senders, because:

    o In case (1) above, every time in tcp_write_xmit() we have enough
    cwnd to send a full-sized skb, we are not fully using the cwnd
    (because cwnd is not a multiple of the TSO skb size). So every time we
    send data, we are not cwnd limited, and so in the cwnd-limited
    tracking code in tcp_cwnd_validate() we mark ourselves as not
    cwnd-limited.

    o In case (2) above, every time in tcp_write_xmit() that we try to
    transmit the "remainder" of the cwnd but defer, we set the local
    variable is_cwnd_limited to true, but we do not send any packets, so
    sent_pkts is zero, so we don't call the cwnd-limited logic to update
    tp->is_cwnd_limited.

    Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler")
    Reported-by: Ingemar Johansson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com
    Signed-off-by: Jakub Kicinski

    Neal Cardwell
     
  • For DCTCP, we have to retain the ECT bits set by the congestion control
    algorithm on the socket when reflecting syn TOS in syn-ack, in order to
    make ECN work properly.

    Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket")
    Reported-by: Alexander Duyck
    Signed-off-by: Wei Wang
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

09 Dec, 2020

1 commit

  • Before commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
    small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.

    This is no longer the case, and Hazem Mohamed Abuelfotoh reported
    that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.

    Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
    of rcvq_space.space computation, while it can select later a smaller
    value for tp->rcv_ssthresh and tp->window_clamp.

    ss -temoi on receiver would show :

    skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596

    This means that TCP can not increase its window in tcp_grow_window(),
    and that DRS can never kick.

    Fix this by making sure that rcvq_space.space is not bigger than number of bytes
    that can be held in TCP receive queue.

    People unable/unwilling to change their kernel can work around this issue by
    selecting a bigger tcp_rmem[1] value as in :

    echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem

    Based on an initial report and patch from Hazem Mohamed Abuelfotoh
    https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/

    Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
    Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
    Reported-by: Hazem Mohamed Abuelfotoh
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Dec, 2020

1 commit

  • When running concurrent iptables rules replacement with data, the per CPU
    sequence count is checked after the assignment of the new information.
    The sequence count is used to synchronize with the packet path without the
    use of any explicit locking. If there are any packets in the packet path using
    the table information, the sequence count is incremented to an odd value and
    is incremented to an even after the packet process completion.

    The new table value assignment is followed by a write memory barrier so every
    CPU should see the latest value. If the packet path has started with the old
    table information, the sequence counter will be odd and the iptables
    replacement will wait till the sequence count is even prior to freeing the
    old table info.

    However, this assumes that the new table information assignment and the memory
    barrier is actually executed prior to the counter check in the replacement
    thread. If CPU decides to execute the assignment later as there is no user of
    the table information prior to the sequence check, the packet path in another
    CPU may use the old table information. The replacement thread would then free
    the table information under it leading to a use after free in the packet
    processing context-

    Unable to handle kernel NULL pointer dereference at virtual
    address 000000000000008e
    pc : ip6t_do_table+0x5d0/0x89c
    lr : ip6t_do_table+0x5b8/0x89c
    ip6t_do_table+0x5d0/0x89c
    ip6table_filter_hook+0x24/0x30
    nf_hook_slow+0x84/0x120
    ip6_input+0x74/0xe0
    ip6_rcv_finish+0x7c/0x128
    ipv6_rcv+0xac/0xe4
    __netif_receive_skb+0x84/0x17c
    process_backlog+0x15c/0x1b8
    napi_poll+0x88/0x284
    net_rx_action+0xbc/0x23c
    __do_softirq+0x20c/0x48c

    This could be fixed by forcing instruction order after the new table
    information assignment or by switching to RCU for the synchronization.

    Fixes: 80055dab5de0 ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
    Reported-by: Sean Tranchetti
    Reported-by: kernel test robot
    Suggested-by: Florian Westphal
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Signed-off-by: Pablo Neira Ayuso

    Subash Abhinov Kasiviswanathan
     

07 Dec, 2020

1 commit

  • Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
    proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
    Otherwize, with a negtive value passed, it will underflow inet_protos.

    This can be reproduced with IPIP FOU:

    # ip fou add port 5555 ipproto 4
    # ethtool -K eth1 rx-gro-list on

    Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
    Reported-by: Guillaume Nault
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

05 Dec, 2020

1 commit

  • Fix to return a negative error code from the error handling
    case instead of 0, as done elsewhere in this function.

    Fixes: d15662682db2 ("ipv4: Allow ipv6 gateway with ipv4 routes")
    Reported-by: Hulk Robot
    Signed-off-by: Zhang Changzhong
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/1607071695-33740-1-git-send-email-zhangchangzhong@huawei.com
    Signed-off-by: Jakub Kicinski

    Zhang Changzhong
     

29 Nov, 2020

1 commit

  • When inet_rtm_getroute() was converted to use the RCU variants of
    ip_route_input() and ip_route_output_key(), the TOS parameters
    stopped being masked with IPTOS_RT_MASK before doing the route lookup.

    As a result, "ip route get" can return a different route than what
    would be used when sending real packets.

    For example:

    $ ip route add 192.0.2.11/32 dev eth0
    $ ip route add unreachable 192.0.2.11/32 tos 2
    $ ip route get 192.0.2.11 tos 2
    RTNETLINK answers: No route to host

    But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
    actually be routed using the first route:

    $ ping -c 1 -Q 2 192.0.2.11
    PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
    64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms

    --- 192.0.2.11 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms

    This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
    return results consistent with real route lookups.

    Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
    Signed-off-by: Guillaume Nault
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.com
    Signed-off-by: Jakub Kicinski

    Guillaume Nault
     

25 Nov, 2020

1 commit

  • When a BPF program is used to select between a type of TCP congestion
    control algorithm that uses either ECN or not there is a case where the
    synack for the frame was coming up without the ECT0 bit set. A bit of
    research found that this was due to the final socket being configured to
    dctcp while the listener socket was staying in cubic.

    To reproduce it all that is needed is to monitor TCP traffic while running
    the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
    assuming tcp_dctcp module is loaded or compiled in and the traffic matches
    the rules in the sample file, is that for all frames with the exception of
    the synack the ECT0 bit is set.

    To address that it is necessary to make one additional call to
    tcp_bpf_ca_needs_ecn using the request socket and then use the output of
    that to set the ECT0 bit for the tos/tclass of the packet.

    Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    Signed-off-by: Alexander Duyck
    Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
    Signed-off-by: Jakub Kicinski

    Alexander Duyck
     

24 Nov, 2020

1 commit

  • When the TCP stack is in SYN flood mode, the server child socket is
    created from the SYN cookie received in a TCP packet with the ACK flag
    set.

    The child socket is created when the server receives the first TCP
    packet with a valid SYN cookie from the client. Usually, this packet
    corresponds to the final step of the TCP 3-way handshake, the ACK
    packet. But is also possible to receive a valid SYN cookie from the
    first TCP data packet sent by the client, and thus create a child socket
    from that SYN cookie.

    Since a client socket is ready to send data as soon as it receives the
    SYN+ACK packet from the server, the client can send the ACK packet (sent
    by the TCP stack code), and the first data packet (sent by the userspace
    program) almost at the same time, and thus the server will equally
    receive the two TCP packets with valid SYN cookies almost at the same
    instant.

    When such event happens, the TCP stack code has a race condition that
    occurs between the momement a lookup is done to the established
    connections hashtable to check for the existence of a connection for the
    same client, and the moment that the child socket is added to the
    established connections hashtable. As a consequence, this race condition
    can lead to a situation where we add two child sockets to the
    established connections hashtable and deliver two sockets to the
    userspace program to the same client.

    This patch fixes the race condition by checking if an existing child
    socket exists for the same client when we are adding the second child
    socket to the established connections socket. If an existing child
    socket exists, we drop the packet and discard the second child socket
    to the same client.

    Signed-off-by: Ricardo Dias
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
    Signed-off-by: Jakub Kicinski

    Ricardo Dias
     

21 Nov, 2020

2 commits

  • When setting congestion control via a BPF program it is seen that the
    SYN/ACK for packets within a given flow will not include the ECT0 flag. A
    bit of simple printk debugging shows that when this is configured without
    BPF we will see the value INET_ECN_xmit value initialized in
    tcp_assign_congestion_control however when we configure this via BPF the
    socket is in the closed state and as such it isn't configured, and I do not
    see it being initialized when we transition the socket into the listen
    state. The result of this is that the ECT0 bit is configured based on
    whatever the default state is for the socket.

    Any easy way to reproduce this is to monitor the following with tcpdump:
    tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca

    Without this patch the SYN/ACK will follow whatever the default is. If dctcp
    all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
    will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
    bit matches the value seen on the other packets in the given stream.

    Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    Signed-off-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Alexander Duyck
     
  • An issue was recently found where DCTCP SYN/ACK packets did not have the
    ECT bit set in the L3 header. A bit of code review found that the recent
    change referenced below had gone though and added a mask that prevented the
    ECN bits from being populated in the L3 header.

    This patch addresses that by rolling back the mask so that it is only
    applied to the flags coming from the incoming TCP request instead of
    applying it to the socket tos/tclass field. Doing this the ECT bits were
    restored in the SYN/ACK packets in my testing.

    One thing that is not addressed by this patch set is the fact that
    tcp_reflect_tos appears to be incompatible with ECN based congestion
    avoidance algorithms. At a minimum the feature should likely be documented
    which it currently isn't.

    Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket")
    Signed-off-by: Alexander Duyck
    Acked-by: Wei Wang
    Signed-off-by: Jakub Kicinski

    Alexander Duyck
     

20 Nov, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    1) libbpf should not attempt to load unused subprogs, from Andrii.

    2) Make strncpy_from_user() mask out bytes after NUL terminator, from Daniel.

    3) Relax return code check for subprograms in the BPF verifier, from Dmitrii.

    4) Fix several sockmap issues, from John.

    * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
    fail_function: Remove a redundant mutex unlock
    selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
    lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
    libbpf: Fix VERSIONED_SYM_COUNT number parsing
    bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list
    bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
    bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
    bpf, sockmap: Use truesize with sk_rmem_schedule()
    bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
    bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
    selftests/bpf: Fix error return code in run_getsockopt_test()
    bpf: Relax return code check for subprograms
    tools, bpftool: Add missing close before bpftool net attach exit
    MAINTAINERS/bpf: Update Andrii's entry.
    selftests/bpf: Fix unused attribute usage in subprogs_unused test
    bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id
    bpf: Fix passing zero to PTR_ERR() in bpf_btf_printf_prepare
    libbpf: Don't attempt to load unused subprog as an entry-point BPF program
    ====================

    Link: https://lore.kernel.org/r/20201119200721.288-1-alexei.starovoitov@gmail.com
    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

18 Nov, 2020

5 commits

  • Checking for ifdef CONFIG_x fails if CONFIG_x=m.

    Use IS_ENABLED instead, which is true for both built-ins and modules.

    Otherwise, a
    > ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1
    fails with the message "Error: IPv6 support not enabled in kernel." if
    CONFIG_IPV6 is `m`.

    In the spirit of b8127113d01e53adba15b41aefd37b90ed83d631.

    Fixes: d15662682db2 ("ipv4: Allow ipv6 gateway with ipv4 routes")
    Cc: Kim Phillips
    Signed-off-by: Florian Klink
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/20201115224509.2020651-1-flokli@flokli.de
    Signed-off-by: Jakub Kicinski

    Florian Klink
     
  • nlmsg_cancel() needs to be called in the error path of
    inet_req_diag_fill to cancel the message.

    Fixes: d545caca827b ("net: inet: diag: expose the socket mark to privileged processes.")
    Reported-by: Hulk Robot
    Signed-off-by: Wang Hai
    Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.com
    Signed-off-by: Jakub Kicinski

    Wang Hai
     
  • Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
    allows users to tune SO_RCVBUF and sockmap will honor them.

    We can refactor the if(charge) case out in later patches. But, keep this
    fix to the point.

    Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Suggested-by: Jakub Sitnicki
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370

    John Fastabend
     
  • If copy_page_to_iter() fails or even partially completes, but with fewer
    bytes copied than expected we currently reset sg.start and return EFAULT.
    This proves problematic if we already copied data into the user buffer
    before we return an error. Because we leave the copied data in the user
    buffer and fail to unwind the scatterlist so kernel side believes data
    has been copied and user side believes data has _not_ been received.

    Expected behavior should be to return number of bytes copied and then
    on the next read we need to return the error assuming its still there. This
    can happen if we have a copy length spanning multiple scatterlist elements
    and one or more complete before the error is hit.

    The error is rare enough though that my normal testing with server side
    programs, such as nginx, httpd, envoy, etc., I have never seen this. The
    only reliable way to reproduce that I've found is to stream movies over
    my browser for a day or so and wait for it to hang. Not very scientific,
    but with a few extra WARN_ON()s in the code the bug was obvious.

    When we review the errors from copy_page_to_iter() it seems we are hitting
    a page fault from copy_page_to_iter_iovec() where the code checks
    fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
    also seems typical server applications don't hit this case.

    The other way to try and reproduce this is run the sockmap selftest tool
    test_sockmap with data verification enabled, but it doesn't reproduce the
    fault. Perhaps we can trigger this case artificially somehow from the
    test tools. I haven't sorted out a way to do that yet though.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370

    John Fastabend
     
  • During loss recovery, retransmitted packets are forced to use TCP
    timestamps to calculate the RTT samples, which have a millisecond
    granularity. BBR is designed using a microsecond granularity. As a
    result, multiple RTT samples could be truncated to the same RTT value
    during loss recovery. This is problematic, as BBR will not enter
    PROBE_RTT if the RTT sample is < the current
    min_rtt sample, rather than
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Yuchung Cheng
    Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.com
    Signed-off-by: Jakub Kicinski

    Ryan Sharpelletti
     

14 Nov, 2020

1 commit

  • Commit 58956317c8de ("neighbor: Improve garbage collection")
    guarantees neighbour table entries a five-second lifetime. Processes
    which make heavy use of multicast can fill the neighour table with
    multicast addresses in five seconds. At that point, neighbour entries
    can't be GC-ed because they aren't five seconds old yet, the kernel
    log starts to fill up with "neighbor table overflow!" messages, and
    sends start to fail.

    This patch allows multicast addresses to be thrown out before they've
    lived out their five seconds. This makes room for non-multicast
    addresses and makes messages to all addresses more reliable in these
    circumstances.

    Fixes: 58956317c8de ("neighbor: Improve garbage collection")
    Signed-off-by: Jeff Dike
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/20201113015815.31397-1-jdike@akamai.com
    Signed-off-by: Jakub Kicinski

    Jeff Dike
     

13 Nov, 2020

2 commits

  • udp{4,6}_lib_lookup_skb() use ip{,v6}_hdr() to get IP header of the
    packet. While it's probably OK for non-frag0 paths, this helpers
    will also point to junk on Fast/frag0 GRO when all headers are
    located in frags. As a result, sk/skb lookup may fail or give wrong
    results. To support both GRO modes, skb_gro_network_header() might
    be used. To not modify original functions, add private versions of
    udp{4,6}_lib_lookup_skb() only to perform correct sk lookups on GRO.

    Present since the introduction of "application-level" UDP GRO
    in 4.7-rc1.

    Misc: replace totally unneeded ternaries with plain ifs.

    Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket")
    Suggested-by: Willem de Bruijn
    Cc: Eric Dumazet
    Signed-off-by: Alexander Lobakin
    Acked-by: Willem de Bruijn
    Signed-off-by: Jakub Kicinski

    Alexander Lobakin
     
  • UDP GRO uses udp_hdr(skb) in its .gro_receive() callback. While it's
    probably OK for non-frag0 paths (when all headers or even the entire
    frame are already in skb head), this inline points to junk when
    using Fast GRO (napi_gro_frags() or napi_gro_receive() with only
    Ethernet header in skb head and all the rest in the frags) and breaks
    GRO packet compilation and the packet flow itself.
    To support both modes, skb_gro_header_fast() + skb_gro_header_slow()
    are typically used. UDP even has an inline helper that makes use of
    them, udp_gro_udphdr(). Use that instead of troublemaking udp_hdr()
    to get rid of the out-of-order delivers.

    Present since the introduction of plain UDP GRO in 5.0-rc1.

    Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
    Cc: Eric Dumazet
    Signed-off-by: Alexander Lobakin
    Acked-by: Willem de Bruijn
    Signed-off-by: Jakub Kicinski

    Alexander Lobakin
     

11 Nov, 2020

1 commit

  • When net.ipv4.tcp_syncookies=1 and syn flood is happened,
    cookie_v4_check or cookie_v6_check tries to redo what
    tcp_v4_send_synack or tcp_v6_send_synack did,
    rsk_window_clamp will be changed if SOCK_RCVBUF is set,
    which will make rcv_wscale is different, the client
    still operates with initial window scale and can overshot
    granted window, the client use the initial scale but local
    server use new scale to advertise window value, and session
    work abnormally.

    Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
    Signed-off-by: Mao Wenan
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/1604967391-123737-1-git-send-email-wenan.mao@linux.alibaba.com
    Signed-off-by: Jakub Kicinski

    Mao Wenan
     

10 Nov, 2020

1 commit

  • Jianlin reports that a bridged IPv6 VXLAN endpoint, carrying IPv6
    packets over a link with a PMTU estimation of exactly 1350 bytes,
    won't trigger ICMPv6 Packet Too Big replies when the encapsulated
    datagrams exceed said PMTU value. VXLAN over IPv6 adds 70 bytes of
    overhead, so an ICMPv6 reply indicating 1280 bytes as inner MTU
    would be legitimate and expected.

    This comes from an off-by-one error I introduced in checks added
    as part of commit 4cb47a8644cc ("tunnels: PMTU discovery support
    for directly bridged IP packets"), whose purpose was to prevent
    sending ICMPv6 Packet Too Big messages with an MTU lower than the
    smallest permissible IPv6 link MTU, i.e. 1280 bytes.

    In iptunnel_pmtud_check_icmpv6(), avoid triggering a reply only if
    the advertised MTU would be less than, and not equal to, 1280 bytes.

    Also fix the analogous comparison for IPv4, that is, skip the ICMP
    reply only if the resulting MTU is strictly less than 576 bytes.

    This becomes apparent while running the net/pmtu.sh bridged VXLAN
    or GENEVE selftests with adjusted lower-link MTU values. Using
    e.g. GENEVE, setting ll_mtu to the values reported below, in the
    test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() test
    function, we can see failures on the following tests:

    test | ll_mtu
    -------------------------------|--------
    pmtu_ipv4_br_geneve4_exception | 626
    pmtu_ipv6_br_geneve4_exception | 1330
    pmtu_ipv6_br_geneve6_exception | 1350

    owing to the different tunneling overheads implied by the
    corresponding configurations.

    Reported-by: Jianlin Shi
    Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets")
    Signed-off-by: Stefano Brivio
    Link: https://lore.kernel.org/r/4f5fc2f33bfdf8409549fafd4f952b008bf04d63.1604681709.git.sbrivio@redhat.com
    Signed-off-by: Jakub Kicinski

    Stefano Brivio
     

05 Nov, 2020

1 commit


01 Nov, 2020

2 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Incorrect netlink report logic in flowtable and genID.

    2) Add a selftest to check that wireguard passes the right sk
    to ip_route_me_harder, from Jason A. Donenfeld.

    3) Pass the actual sk to ip_route_me_harder(), also from Jason.

    4) Missing expression validation of updates via nft --check.

    5) Update byte and packet counters regardless of whether they
    match, from Stefano Brivio.
    ====================

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • The tunnel device such as vxlan, bareudp and geneve in the lwt mode set
    the outer df only based TUNNEL_DONT_FRAGMENT.
    And this was also the behavior for gre device before switching to use
    ip_md_tunnel_xmit in commit 962924fa2b7a ("ip_gre: Refactor collect
    metatdata mode tunnel xmit to ip_md_tunnel_xmit")

    When the ip_gre in lwt mode xmit with ip_md_tunnel_xmi changed the rule and
    make the discrepancy between handling of DF by different tunnels. So in the
    ip_md_tunnel_xmit should follow the same rule like other tunnels.

    Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
    Signed-off-by: wenxu
    Link: https://lore.kernel.org/r/1604028728-31100-1-git-send-email-wenxu@ucloud.cn
    Signed-off-by: Jakub Kicinski

    wenxu
     

30 Oct, 2020

1 commit

  • If netfilter changes the packet mark when mangling, the packet is
    rerouted using the route_me_harder set of functions. Prior to this
    commit, there's one big difference between route_me_harder and the
    ordinary initial routing functions, described in the comment above
    __ip_queue_xmit():

    /* Note: skb->sk can be different from sk, in case of tunnels */
    int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,

    That function goes on to correctly make use of sk->sk_bound_dev_if,
    rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
    tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
    It will make some transformations to that packet, and then it will send
    the encapsulated packet out of a *new* socket. That new socket will
    basically always have a different sk_bound_dev_if (otherwise there'd be
    a routing loop). So for the purposes of routing the encapsulated packet,
    the routing information as it pertains to the socket should come from
    that socket's sk, rather than the packet's original skb->sk. For that
    reason __ip_queue_xmit() and related functions all do the right thing.

    One might argue that all tunnels should just call skb_orphan(skb) before
    transmitting the encapsulated packet into the new socket. But tunnels do
    *not* do this -- and this is wisely avoided in skb_scrub_packet() too --
    because features like TSQ rely on skb->destructor() being called when
    that buffer space is truely available again. Calling skb_orphan(skb) too
    early would result in buffers filling up unnecessarily and accounting
    info being all wrong. Instead, additional routing must take into account
    the new sk, just as __ip_queue_xmit() notes.

    So, this commit addresses the problem by fishing the correct sk out of
    state->sk -- it's already set properly in the call to nf_hook() in
    __ip_local_out(), which receives the sk as part of its normal
    functionality. So we make sure to plumb state->sk through the various
    route_me_harder functions, and then make correct use of it following the
    example of __ip_queue_xmit().

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Jason A. Donenfeld
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jason A. Donenfeld
     

24 Oct, 2020

1 commit

  • With SO_RCVLOWAT, under memory pressure,
    it is possible to enter a state where:

    1. We have not received enough bytes to satisfy SO_RCVLOWAT.
    2. We have not entered buffer pressure (see tcp_rmem_pressure()).
    3. But, we do not have enough buffer space to accept more packets.

    In this case, we advertise 0 rwnd (due to #3) but the application does
    not drain the receive queue (no wakeup because of #1 and #2) so the
    flow stalls.

    Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
    rwnd
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com
    Signed-off-by: Jakub Kicinski

    Arjun Roy
     

23 Oct, 2020

1 commit

  • In the header prediction fast path for a bulk data receiver, if no
    data is newly acknowledged then we do not call tcp_ack() and do not
    call tcp_ack_update_window(). This means that a bulk receiver that
    receives large amounts of data can have the incoming sequence numbers
    wrap, so that the check in tcp_may_update_window fails:
    after(ack_seq, tp->snd_wl1)

    If the incoming receive windows are zero in this state, and then the
    connection that was a bulk data receiver later wants to send data,
    that connection can find itself persistently rejecting the window
    updates in incoming ACKs. This means the connection can persistently
    fail to discover that the receive window has opened, which in turn
    means that the connection is unable to send anything, and the
    connection's sending process can get permanently "stuck".

    The fix is to update snd_wl1 in the header prediction fast path for a
    bulk data receiver, so that it keeps up and does not see wrapping
    problems.

    This fix is based on a very nice and thorough analysis and diagnosis
    by Apollon Oikonomopoulos (see link below).

    This is a stable candidate but there is no Fixes tag here since the
    bug predates current git history. Just for fun: looks like the bug
    dates back to when header prediction was added in Linux v2.1.8 in Nov
    1996. In that version tcp_rcv_established() was added, and the code
    only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer:
    receiver" code path it does not call tcp_ack(). This fix seems to
    apply cleanly at least as far back as v3.2.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Neal Cardwell
    Reported-by: Apollon Oikonomopoulos
    Tested-by: Apollon Oikonomopoulos
    Link: https://www.spinics.net/lists/netdev/msg692430.html
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.com
    Signed-off-by: Jakub Kicinski

    Neal Cardwell
     

20 Oct, 2020

1 commit

  • While insertion of 16k nexthops all using the same netdev ('dummy10')
    takes less than a second, deletion takes about 130 seconds:

    # time -p ip -b nexthop.batch
    real 0.29
    user 0.01
    sys 0.15

    # time -p ip link set dev dummy10 down
    real 131.03
    user 0.06
    sys 0.52

    This is because of repeated calls to synchronize_rcu() whenever a
    nexthop is removed from a nexthop group:

    # /usr/share/bcc/tools/offcputime -p `pgrep -nx ip` -K
    ...
    b'finish_task_switch'
    b'schedule'
    b'schedule_timeout'
    b'wait_for_completion'
    b'__wait_rcu_gp'
    b'synchronize_rcu.part.0'
    b'synchronize_rcu'
    b'__remove_nexthop'
    b'remove_nexthop'
    b'nexthop_flush_dev'
    b'nh_netdev_event'
    b'raw_notifier_call_chain'
    b'call_netdevice_notifiers_info'
    b'__dev_notify_flags'
    b'dev_change_flags'
    b'do_setlink'
    b'__rtnl_newlink'
    b'rtnl_newlink'
    b'rtnetlink_rcv_msg'
    b'netlink_rcv_skb'
    b'rtnetlink_rcv'
    b'netlink_unicast'
    b'netlink_sendmsg'
    b'____sys_sendmsg'
    b'___sys_sendmsg'
    b'__sys_sendmsg'
    b'__x64_sys_sendmsg'
    b'do_syscall_64'
    b'entry_SYSCALL_64_after_hwframe'
    - ip (277)
    126554955

    Since nexthops are always deleted under RTNL, synchronize_net() can be
    used instead. It will call synchronize_rcu_expedited() which only blocks
    for several microseconds as opposed to multiple milliseconds like
    synchronize_rcu().

    With this patch deletion of 16k nexthops takes less than a second:

    # time -p ip link set dev dummy10 down
    real 0.12
    user 0.00
    sys 0.04

    Tested with fib_nexthops.sh which includes torture tests that prompted
    the initial change:

    # ./fib_nexthops.sh
    ...
    Tests passed: 134
    Tests failed: 0

    Fixes: 90f33bffa382 ("nexthops: don't modify published nexthop groups")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jesse Brandeburg
    Reviewed-by: David Ahern
    Acked-by: Nikolay Aleksandrov
    Link: https://lore.kernel.org/r/20201016172914.643282-1-idosch@idosch.org
    Signed-off-by: Jakub Kicinski

    Ido Schimmel
     

17 Oct, 2020

1 commit

  • Keyu Man reported that the ICMP rate limiter could be used
    by attackers to get useful signal. Details will be provided
    in an upcoming academic publication.

    Our solution is to add some noise, so that the attackers
    no longer can get help from the predictable token bucket limiter.

    Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
    Signed-off-by: Eric Dumazet
    Reported-by: Keyu Man
    Signed-off-by: Jakub Kicinski

    Eric Dumazet