06 Sep, 2019

9 commits

  • [ Upstream commit e858faf556d4e14c750ba1e8852783c6f9520a0e ]

    If an app is playing tricks to reuse a socket via tcp_disconnect(),
    bytes_acked/received needs to be reset to 0. Otherwise tcp_info will
    report the sum of the current and the old connection..

    Cc: Eric Dumazet
    Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
    Fixes: bdd1f9edacb5 ("tcp: add tcpi_bytes_received to tcp_info")
    Signed-off-by: Christoph Paasch
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit 1b200acde418f4d6d87279d3f6f976ebf188f272)

    Christoph Paasch
     
  • [ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ]

    Neal reported incorrect use of ns_capable() from bpf hook.

    bpf_setsockopt(...TCP_CONGESTION...)
    -> tcp_set_congestion_control()
    -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
    -> ns_capable_common()
    -> current_cred()
    -> rcu_dereference_protected(current->cred, 1)

    Accessing 'current' in bpf context makes no sense, since packets
    are processed from softirq context.

    As Neal stated : The capability check in tcp_set_congestion_control()
    was written assuming a system call context, and then was reused from
    a BPF call site.

    The fix is to add a new parameter to tcp_set_congestion_control(),
    so that the ns_capable() call is only performed under the right
    context.

    Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    Signed-off-by: Eric Dumazet
    Cc: Lawrence Brakmo
    Reported-by: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Lawrence Brakmo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit c60f57dfe995172c2f01e59266e3ffa3419c6cd9)

    Eric Dumazet
     
  • [ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ]

    Some applications set tiny SO_SNDBUF values and expect
    TCP to just work. Recent patches to address CVE-2019-11478
    broke them in case of losses, since retransmits might
    be prevented.

    We should allow these flows to make progress.

    This patch allows the first and last skb in retransmit queue
    to be split even if memory limits are hit.

    It also adds the some room due to the fact that tcp_sendmsg()
    and tcp_sendpage() might overshoot sk_wmem_queued by about one full
    TSO skb (64KB size). Note this allowance was already present
    in stable backports for kernels < 4.15

    Note for < 4.15 backports :
    tcp_rtx_queue_tail() will probably look like :

    static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
    {
    struct sk_buff *skb = tcp_send_head(sk);

    return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
    }

    Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrew Prout
    Tested-by: Andrew Prout
    Tested-by: Jonathan Lemon
    Tested-by: Michal Kubecek
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Christoph Paasch
    Cc: Jonathan Looney
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit 6323c238bb4374d1477348cfbd5854f2bebe9a21)

    Eric Dumazet
     
  • commit b6653b3629e5b88202be3c9abc44713973f5c4b4 upstream.

    tcp_fragment() might be called for skbs in the write queue.

    Memory limits might have been exceeded because tcp_sendmsg() only
    checks limits at full skb (64KB) boundaries.

    Therefore, we need to make sure tcp_fragment() wont punish applications
    that might have setup very low SO_SNDBUF values.

    Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
    Signed-off-by: Eric Dumazet
    Reported-by: Christoph Paasch
    Tested-by: Christoph Paasch
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    (cherry picked from commit dad3a9314ac95dedc007bc7dacacb396ea10e376)

    Eric Dumazet
     
  • commit 967c05aee439e6e5d7d805e195b3a20ef5c433d6 upstream.

    If mtu probing is enabled tcp_mtu_probing() could very well end up
    with a too small MSS.

    Use the new sysctl tcp_min_snd_mss to make sure MSS search
    is performed in an acceptable range.

    CVE-2019-11479 -- tcp mss hardcoded to 48

    Signed-off-by: Eric Dumazet
    Reported-by: Jonathan Lemon
    Cc: Jonathan Looney
    Acked-by: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Tyler Hicks
    Cc: Bruce Curtis
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit 59222807fcc99951dc769cd50e132e319d73d699)

    Eric Dumazet
     
  • commit 5f3e2bf008c2221478101ee72f5cb4654b9fc363 upstream.

    Some TCP peers announce a very small MSS option in their SYN and/or
    SYN/ACK messages.

    This forces the stack to send packets with a very high network/cpu
    overhead.

    Linux has enforced a minimal value of 48. Since this value includes
    the size of TCP options, and that the options can consume up to 40
    bytes, this means that each segment can include only 8 bytes of payload.

    In some cases, it can be useful to increase the minimal value
    to a saner value.

    We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
    reasons.

    Note that TCP_MAXSEG socket option enforces a minimal value
    of (TCP_MIN_MSS). David Miller increased this minimal value
    in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.")
    from 64 to 88.

    We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.

    CVE-2019-11479 -- tcp mss hardcoded to 48

    Signed-off-by: Eric Dumazet
    Suggested-by: Jonathan Looney
    Acked-by: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Tyler Hicks
    Cc: Bruce Curtis
    Cc: Jonathan Lemon
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit 7f9f8a37e563c67b24ccd57da1d541a95538e8d9)

    Eric Dumazet
     
  • commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream.

    Jonathan Looney reported that a malicious peer can force a sender
    to fragment its retransmit queue into tiny skbs, inflating memory
    usage and/or overflow 32bit counters.

    TCP allows an application to queue up to sk_sndbuf bytes,
    so we need to give some allowance for non malicious splitting
    of retransmit queue.

    A new SNMP counter is added to monitor how many times TCP
    did not allow to split an skb if the allowance was exceeded.

    Note that this counter might increase in the case applications
    use SO_SNDBUF socket option to lower sk_sndbuf.

    CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
    socket is already using more than half the allowed space

    Signed-off-by: Eric Dumazet
    Reported-by: Jonathan Looney
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Reviewed-by: Tyler Hicks
    Cc: Bruce Curtis
    Cc: Jonathan Lemon
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit ec83921899a571ad70d582934ee9e3e07f478848)

    Eric Dumazet
     
  • commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream.

    Jonathan Looney reported that TCP can trigger the following crash
    in tcp_shifted_skb() :

    BUG_ON(tcp_skb_pcount(skb) < pcount);

    This can happen if the remote peer has advertized the smallest
    MSS that linux TCP accepts : 48

    An skb can hold 17 fragments, and each fragment can hold 32KB
    on x86, or 64KB on PowerPC.

    This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
    can overflow.

    Note that tcp_sendmsg() builds skbs with less than 64KB
    of payload, so this problem needs SACK to be enabled.
    SACK blocks allow TCP to coalesce multiple skbs in the retransmit
    queue, thus filling the 17 fragments to maximal capacity.

    CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

    Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing")
    Signed-off-by: Eric Dumazet
    Reported-by: Jonathan Looney
    Acked-by: Neal Cardwell
    Reviewed-by: Tyler Hicks
    Cc: Yuchung Cheng
    Cc: Bruce Curtis
    Cc: Jonathan Lemon
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit c09be31461ed140976c60a87364415454a2c3d42)

    Eric Dumazet
     
  • [ Upstream commit 50ce163a72d817a99e8974222dcf2886d5deb1ae ]

    For some reason, tcp_grow_window() correctly tests if enough room
    is present before attempting to increase tp->rcv_ssthresh,
    but does not prevent it to grow past tcp_space()

    This is causing hard to debug issues, like failing
    the (__tcp_select_window(sk) >= tp->rcv_wnd) test
    in __tcp_ack_snd_check(), causing ACK delays and possibly
    slow flows.

    Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio,
    we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000"
    after about 60 round trips, when the active side no longer sends
    immediate acks.

    This bug predates git history.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Acked-by: Wei Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit 6728c6174a47b8a04ceec89aca9e1195dee7ff6b)

    Eric Dumazet
     

02 May, 2019

2 commits

  • Add security setting check for socket interface since
    stack will check the return value.

    Signed-off-by: Fugang Duan
    Signed-off-by: Marcel Holtmann
    Signed-off-by: Arulpandiyan Vadivel
    Signed-off-by: Shrikant Bobade
    (cherry picked from commit a23098e20ce214d134cc3cef3ad51a6bba13b99a)

    Fugang Duan
     
  • The commit ae97fd867aa3 ("MLK-19091 cfg80211: make phy index match
    after wiphy dev is released") manage wiphy_counter matching between
    creating and freeing wiphy device. Then for one wifi instance, the index
    of attached phy is not changed during loadable test. But it ignores
    multiple wifi cards loadable test case, that introduces the phy index
    confliction. So the patch revert the commit.

    Reviewed-by: Richard Zhu
    Signed-off-by: Fugang Duan
    Signed-off-by: Arulpandiyan Vadivel
    Signed-off-by: Shrikant Bobade
    (cherry picked from commit 4bfe854b650f1e8bc46624d5f3b559f0112f327a)

    Andy Duan
     

18 Apr, 2019

2 commits

  • During insmod/rmmod test, the phy index increases that cause troube
    for test case. To make global variable wiphy_counter match between
    creat and free wiphy device, it needs to decrease the atomic counter
    when wiphy device is freed.

    Reviewed-by: Richard Zhu
    Signed-off-by: Fugang Duan
    Signed-off-by: Vipul Kumar

    Andy Duan
     
  • [Patch] Pulling the following commits and some general changes
    from custom v3.10 kernel for supporting qcacld2.0 on kernel v4.9.11.
    1. cfg80211: Using new wiphy flag WIPHY_FLAG_DFS_OFFLOAD
    When flag WIPHY_FLAG_DFS_OFFLOAD is defined, the driver would handle
    all the DFS related operations. Therefore the kernel needs to ignore
    the DFS state that it uses to block the userspace calls to the driver
    through cfg80211 APIs. Also it should treat the userspace calls to
    start radar detection as a no-op.

    Please note that changes in util.c is not picked up explicitly.
    Kernel v4.9.11 uses wrapper cfg80211_get_chans_dfs_required which takes
    care of this change.

    Change-Id: I9dd2076945581ca67e54dfc96dd3dbc526c6f0a2
    IRs-Fixed: 202686

    2. New db.txt from git/sforshee/wireless-regdb.git
    CONFIG_CFG80211_INTERNAL_REGDB is enabled in build. This causes
    kernel warn messages as db.txt is empty. A new db.txt is added
    from:
    git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/wireless-regdb.git

    IRs-Fixed: 202686

    3. Picked up the declaration and definition of the function
    cfg80211_is_gratuitous_arp_unsolicited_na

    Change-Id: I1e4083a2327c121073226aa6b75bb6b5b97cec00
    CRs-fixed: 1079453

    Signed-off-by: Nakul Kachhwaha
    Signed-off-by: Fugang Duan
    (Vipul: Fixed merge conflicts)
    (TODO: checkpatch warnings)
    Signed-off-by: Vipul Kumar

    Nakul Kachhwaha
     

17 Apr, 2019

20 commits

  • commit 89259088c1b7fecb43e8e245dc931909132a4e03 upstream

    syzbot was able to trigger the WARN in cttimeout_default_get() by
    passing UDPLITE as l4protocol. Alias UDPLITE to UDP, both use
    same timeout values.

    Furthermore, also fetch GRE timeouts. GRE is a bit more complicated,
    as it still can be a module and its netns_proto_gre struct layout isn't
    visible outside of the gre module. Can't move timeouts around, it
    appears conntrack sysctl unregister assumes net_generic() returns
    nf_proto_net, so we get crash. Expose layout of netns_proto_gre instead.

    A followup nf-next patch could make gre tracker be built-in as well
    if needed, its not that large.

    Last, make the WARN() mention the missing protocol value in case
    anything else is missing.

    Reported-by: syzbot+2fae8fa157dd92618cae@syzkaller.appspotmail.com
    Fixes: 8866df9264a3 ("netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr")
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Zubin Mithra
    Signed-off-by: Sasha Levin (Microsoft)

    Florian Westphal
     
  • commit 8866df9264a34e675b4ee8a151db819b87cce2d3 upstream

    Otherwise, we hit a NULL pointer deference since handlers always assume
    default timeout policy is passed.

    netlink: 24 bytes leftover after parsing attributes in process `syz-executor2'.
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 9575 Comm: syz-executor1 Not tainted 4.19.0+ #312
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:icmp_timeout_obj_to_nlattr+0x77/0x170 net/netfilter/nf_conntrack_proto_icmp.c:297

    Fixes: c779e849608a ("netfilter: conntrack: remove get_timeout() indirection")
    Reported-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Zubin Mithra
    Signed-off-by: Sasha Levin (Microsoft)

    Pablo Neira Ayuso
     
  • [ Upstream commit 9a5a90d167b0e5fe3d47af16b68fd09ce64085cd ]

    __netif_receive_skb_list_ptype() leaves skb->next poisoned before passing
    it to pt_prev->func handler, what may produce (in certain cases, e.g. DSA
    setup) crashes like:

    [ 88.606777] CPU 0 Unable to handle kernel paging request at virtual address 0000000e, epc == 80687078, ra == 8052cc7c
    [ 88.618666] Oops[#1]:
    [ 88.621196] CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-rc2-dlink-00206-g4192a172-dirty #1473
    [ 88.630885] $ 0 : 00000000 10000400 00000002 864d7850
    [ 88.636709] $ 4 : 87c0ddf0 864d7800 87c0ddf0 00000000
    [ 88.642526] $ 8 : 00000000 49600000 00000001 00000001
    [ 88.648342] $12 : 00000000 c288617b dadbee27 25d17c41
    [ 88.654159] $16 : 87c0ddf0 85cff080 80790000 fffffffd
    [ 88.659975] $20 : 80797b20 ffffffff 00000001 864d7800
    [ 88.665793] $24 : 00000000 8011e658
    [ 88.671609] $28 : 80790000 87c0dbc0 87cabf00 8052cc7c
    [ 88.677427] Hi : 00000003
    [ 88.680622] Lo : 7b5b4220
    [ 88.683840] epc : 80687078 vlan_dev_hard_start_xmit+0x1c/0x1a0
    [ 88.690532] ra : 8052cc7c dev_hard_start_xmit+0xac/0x188
    [ 88.696734] Status: 10000404 IEp
    [ 88.700422] Cause : 50000008 (ExcCode 02)
    [ 88.704874] BadVA : 0000000e
    [ 88.708069] PrId : 0001a120 (MIPS interAptiv (multi))
    [ 88.713005] Modules linked in:
    [ 88.716407] Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=00000000)
    [ 88.725219] Stack : 85f61c28 00000000 0000000e 80780000 87c0ddf0 85cff080 80790000 8052cc7c
    [ 88.734529] 87cabf00 00000000 00000001 85f5fb40 807b0000 864d7850 87cabf00 807d0000
    [ 88.743839] 864d7800 8655f600 00000000 85cff080 87c1c000 0000006a 00000000 8052d96c
    [ 88.753149] 807a0000 8057adb8 87c0dcc8 87c0dc50 85cfff08 00000558 87cabf00 85f58c50
    [ 88.762460] 00000002 85f58c00 864d7800 80543308 fffffff4 00000001 85f58c00 864d7800
    [ 88.771770] ...
    [ 88.774483] Call Trace:
    [ 88.777199] [] vlan_dev_hard_start_xmit+0x1c/0x1a0
    [ 88.783504] [] dev_hard_start_xmit+0xac/0x188
    [ 88.789326] [] __dev_queue_xmit+0x6e8/0x7d4
    [ 88.794955] [] ip_finish_output2+0x238/0x4d0
    [ 88.800677] [] ip_output+0xc8/0x140
    [ 88.805526] [] ip_forward+0x364/0x560
    [ 88.810567] [] ip_rcv+0x48/0xe4
    [ 88.815030] [] __netif_receive_skb_one_core+0x44/0x58
    [ 88.821635] [] dsa_switch_rcv+0x108/0x1ac
    [ 88.827067] [] __netif_receive_skb_list_core+0x228/0x26c
    [ 88.833951] [] netif_receive_skb_list+0x1d4/0x394
    [ 88.840160] [] lunar_rx_poll+0x38c/0x828
    [ 88.845496] [] net_rx_action+0x14c/0x3cc
    [ 88.850835] [] __do_softirq+0x178/0x338
    [ 88.856077] [] irq_exit+0xbc/0x100
    [ 88.860846] [] plat_irq_dispatch+0xc0/0x144
    [ 88.866477] [] handle_int+0x14c/0x158
    [ 88.871516] [] r4k_wait+0x30/0x40
    [ 88.876462] Code: afb10014 8c8200a0 00803025 94a20468 00000000 10620042 00a08025 9605046a
    [ 88.887332]
    [ 88.888982] ---[ end trace eb863d007da11cf1 ]---
    [ 88.894122] Kernel panic - not syncing: Fatal exception in interrupt
    [ 88.901202] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

    Fix this by pulling skb off the sublist and zeroing skb->next pointer
    before calling ptype callback.

    Fixes: 88eb1944e18c ("net: core: propagate SKB lists through packet_type lookup")
    Reviewed-by: Edward Cree
    Signed-off-by: Alexander Lobakin
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Alexander Lobakin
     
  • [ Upstream commit 2a3cabae4536edbcb21d344e7aa8be7a584d2afb ]

    erspan_v6 tunnels run __iptunnel_pull_header on received skbs to remove
    erspan header. This can determine a possible use-after-free accessing
    pkt_md pointer in ip6erspan_rcv since the packet will be 'uncloned'
    running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
    been sent though a veth device). Fix it resetting pkt_md pointer after
    __iptunnel_pull_header

    Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code")
    Signed-off-by: Lorenzo Bianconi
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Lorenzo Bianconi
     
  • [ Upstream commit 492b67e28ee5f2a2522fb72e3d3bcb990e461514 ]

    erspan tunnels run __iptunnel_pull_header on received skbs to remove
    gre and erspan headers. This can determine a possible use-after-free
    accessing pkt_md pointer in erspan_rcv since the packet will be 'uncloned'
    running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
    been sent though a veth device). Fix it resetting pkt_md pointer after
    __iptunnel_pull_header

    Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code")
    Signed-off-by: Lorenzo Bianconi
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Lorenzo Bianconi
     
  • [ Upstream commit 8c83f2df9c6578ea4c5b940d8238ad8a41b87e9e ]

    Configuration check to accept source route IP options should be made on
    the incoming netdevice when the skb->dev is an l3mdev master. The route
    lookup for the source route next hop also needs the incoming netdev.

    v2->v3:
    - Simplify by passing the original netdevice down the stack (per David
    Ahern).

    Signed-off-by: Stephen Suryaputra
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Stephen Suryaputra
     
  • [ Upstream commit b506bc975f60f06e13e74adb35e708a23dc4e87c ]

    When tcp_sk_init() failed in inet_ctl_sock_create(),
    'net->ipv4.tcp_congestion_control' will be left
    uninitialized, but tcp_sk_exit() hasn't check for
    that.

    This patch add checking on 'net->ipv4.tcp_congestion_control'
    in tcp_sk_exit() to prevent NULL-ptr dereference.

    Fixes: 6670e1524477 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control")
    Signed-off-by: Dust Li
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Dust Li
     
  • [ Upstream commit aecfde23108b8e637d9f5c5e523b24fb97035dc3 ]

    RFC8257 §3.5 explicitly states that "A DCTCP sender MUST react to
    loss episodes in the same way as conventional TCP".

    Currently, Linux DCTCP performs no cwnd reduction when losses
    are encountered. Optionally, the dctcp_clamp_alpha_on_loss resets
    alpha to its maximal value if a RTO happens. This behavior
    is sub-optimal for at least two reasons: i) it ignores losses
    triggering fast retransmissions; and ii) it causes unnecessary large
    cwnd reduction in the future if the loss was isolated as it resets
    the historical term of DCTCP's alpha EWMA to its maximal value (i.e.,
    denoting a total congestion). The second reason has an especially
    noticeable effect when using DCTCP in high BDP environments, where
    alpha normally stays at low values.

    This patch replace the clamping of alpha by setting ssthresh to
    half of cwnd for both fast retransmissions and RTOs, at most once
    per RTT. Consequently, the dctcp_clamp_alpha_on_loss module parameter
    has been removed.

    The table below shows experimental results where we measured the
    drop probability of a PIE AQM (not applying ECN marks) at a
    bottleneck in the presence of a single TCP flow with either the
    alpha-clamping option enabled or the cwnd halving proposed by this
    patch. Results using reno or cubic are given for comparison.

    | Link | RTT | Drop
    TCP CC | speed | base+AQM | probability
    ==================|=========|==========|============
    CUBIC | 40Mbps | 7+20ms | 0.21%
    RENO | | | 0.19%
    DCTCP-CLAMP-ALPHA | | | 25.80%
    DCTCP-HALVE-CWND | | | 0.22%
    ------------------|---------|----------|------------
    CUBIC | 100Mbps | 7+20ms | 0.03%
    RENO | | | 0.02%
    DCTCP-CLAMP-ALPHA | | | 23.30%
    DCTCP-HALVE-CWND | | | 0.04%
    ------------------|---------|----------|------------
    CUBIC | 800Mbps | 1+1ms | 0.04%
    RENO | | | 0.05%
    DCTCP-CLAMP-ALPHA | | | 18.70%
    DCTCP-HALVE-CWND | | | 0.06%

    We see that, without halving its cwnd for all source of losses,
    DCTCP drives the AQM to large drop probabilities in order to keep
    the queue length under control (i.e., it repeatedly faces RTOs).
    Instead, if DCTCP reacts to all source of losses, it can then be
    controlled by the AQM using similar drop levels than cubic or reno.

    Signed-off-by: Koen De Schepper
    Signed-off-by: Olivier Tilmans
    Cc: Bob Briscoe
    Cc: Lawrence Brakmo
    Cc: Florian Westphal
    Cc: Daniel Borkmann
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Andrew Shewmaker
    Cc: Glenn Judd
    Acked-by: Florian Westphal
    Acked-by: Neal Cardwell
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Koen De Schepper
     
  • [ Upstream commit 09279e615c81ce55e04835970601ae286e3facbe ]

    Syzbot report a kernel-infoleak:

    BUG: KMSAN: kernel-infoleak in _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
    Call Trace:
    _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
    copy_to_user include/linux/uaccess.h:174 [inline]
    sctp_getsockopt_peer_addrs net/sctp/socket.c:5911 [inline]
    sctp_getsockopt+0x1668e/0x17f70 net/sctp/socket.c:7562
    ...
    Uninit was stored to memory at:
    sctp_transport_init net/sctp/transport.c:61 [inline]
    sctp_transport_new+0x16d/0x9a0 net/sctp/transport.c:115
    sctp_assoc_add_peer+0x532/0x1f70 net/sctp/associola.c:637
    sctp_process_param net/sctp/sm_make_chunk.c:2548 [inline]
    sctp_process_init+0x1a1b/0x3ed0 net/sctp/sm_make_chunk.c:2361
    ...
    Bytes 8-15 of 16 are uninitialized

    It was caused by that th _pad field (the 8-15 bytes) of a v4 addr (saved in
    struct sockaddr_in) wasn't initialized, but directly copied to user memory
    in sctp_getsockopt_peer_addrs().

    So fix it by calling memset(addr->v4.sin_zero, 0, 8) to initialize _pad of
    sockaddr_in before copying it to user memory in sctp_v4_addr_to_user(), as
    sctp_v6_addr_to_user() does.

    Reported-by: syzbot+86b5c7c236a22616a72f@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Tested-by: Alexander Potapenko
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Xin Long
     
  • [ Upstream commit f28cd2af22a0c134e4aa1c64a70f70d815d473fb ]

    The flow action buffer can be resized if it's not big enough to contain
    all the requested flow actions. However, this resize doesn't take into
    account the new requested size, the buffer is only increased by a factor
    of 2x. This might be not enough to contain the new data, causing a
    buffer overflow, for example:

    [ 42.044472] =============================================================================
    [ 42.045608] BUG kmalloc-96 (Not tainted): Redzone overwritten
    [ 42.046415] -----------------------------------------------------------------------------

    [ 42.047715] Disabling lock debugging due to kernel taint
    [ 42.047716] INFO: 0x8bf2c4a5-0x720c0928. First byte 0x0 instead of 0xcc
    [ 42.048677] INFO: Slab 0xbc6d2040 objects=29 used=18 fp=0xdc07dec4 flags=0x2808101
    [ 42.049743] INFO: Object 0xd53a3464 @offset=2528 fp=0xccdcdebb

    [ 42.050747] Redzone 76f1b237: cc cc cc cc cc cc cc cc ........
    [ 42.051839] Object d53a3464: 6b 6b 6b 6b 6b 6b 6b 6b 0c 00 00 00 6c 00 00 00 kkkkkkkk....l...
    [ 42.053015] Object f49a30cc: 6c 00 0c 00 00 00 00 00 00 00 00 03 78 a3 15 f6 l...........x...
    [ 42.054203] Object acfe4220: 20 00 02 00 ff ff ff ff 00 00 00 00 00 00 00 00 ...............
    [ 42.055370] Object 21024e91: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    [ 42.056541] Object 070e04c3: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    [ 42.057797] Object 948a777a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    [ 42.059061] Redzone 8bf2c4a5: 00 00 00 00 ....
    [ 42.060189] Padding a681b46e: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ

    Fix by making sure the new buffer is properly resized to contain all the
    requested data.

    BugLink: https://bugs.launchpad.net/bugs/1813244
    Signed-off-by: Andrea Righi
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Andrea Righi
     
  • [ Upstream commit 0db6f8befc32c68bb13d7ffbb2e563c79e913e13 ]

    It returned always NULL, thus it was never possible to get the filter.

    Example:
    $ ip link add foo type dummy
    $ ip link add bar type dummy
    $ tc qdisc add dev foo clsact
    $ tc filter add dev foo protocol all pref 1 ingress handle 1234 \
    matchall action mirred ingress mirror dev bar

    Before the patch:
    $ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall
    Error: Specified filter handle not found.
    We have an error talking to the kernel

    After:
    $ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall
    filter ingress protocol all pref 1 matchall chain 0 handle 0x4d2
    not_in_hw
    action order 1: mirred (Ingress Mirror to device bar) pipe
    index 1 ref 1 bind 1

    CC: Yotam Gigi
    CC: Jiri Pirko
    Fixes: fd62d9f5c575 ("net/sched: matchall: Fix configuration race")
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Nicolas Dichtel
     
  • [ Upstream commit fae2708174ae95d98d19f194e03d6e8f688ae195 ]

    the control path of 'sample' action does not validate the value of 'rate'
    provided by the user, but then it uses it as divisor in the traffic path.
    Validate it in tcf_sample_init(), and return -EINVAL with a proper extack
    message in case that value is zero, to fix a splat with the script below:

    # tc f a dev test0 egress matchall action sample rate 0 group 1 index 2
    # tc -s a s action sample
    total acts 1

    action order 0: sample rate 1/0 group 1 pipe
    index 2 ref 1 bind 1 installed 19 sec used 19 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    # ping 192.0.2.1 -I test0 -c1 -q

    divide error: 0000 [#1] SMP PTI
    CPU: 1 PID: 6192 Comm: ping Not tainted 5.1.0-rc2.diag2+ #591
    Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    RIP: 0010:tcf_sample_act+0x9e/0x1e0 [act_sample]
    Code: 6a f1 85 c0 74 0d 80 3d 83 1a 00 00 00 0f 84 9c 00 00 00 4d 85 e4 0f 84 85 00 00 00 e8 9b d7 9c f1 44 8b 8b e0 00 00 00 31 d2 f7 f1 85 d2 75 70 f6 85 83 00 00 00 10 48 8b 45 10 8b 88 08 01
    RSP: 0018:ffffae320190ba30 EFLAGS: 00010246
    RAX: 00000000b0677d21 RBX: ffff8af1ed9ec000 RCX: 0000000059a9fe49
    RDX: 0000000000000000 RSI: 000000000c7e33b7 RDI: ffff8af23daa0af0
    RBP: ffff8af1ee11b200 R08: 0000000074fcaf7e R09: 0000000000000000
    R10: 0000000000000050 R11: ffffffffb3088680 R12: ffff8af232307f80
    R13: 0000000000000003 R14: ffff8af1ed9ec000 R15: 0000000000000000
    FS: 00007fe9c6d2f740(0000) GS:ffff8af23da80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fff6772f000 CR3: 00000000746a2004 CR4: 00000000001606e0
    Call Trace:
    tcf_action_exec+0x7c/0x1c0
    tcf_classify+0x57/0x160
    __dev_queue_xmit+0x3dc/0xd10
    ip_finish_output2+0x257/0x6d0
    ip_output+0x75/0x280
    ip_send_skb+0x15/0x40
    raw_sendmsg+0xae3/0x1410
    sock_sendmsg+0x36/0x40
    __sys_sendto+0x10e/0x140
    __x64_sys_sendto+0x24/0x30
    do_syscall_64+0x60/0x210
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [...]
    Kernel panic - not syncing: Fatal exception in interrupt

    Add a TDC selftest to document that 'rate' is now being validated.

    Reported-by: Matteo Croce
    Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
    Signed-off-by: Davide Caratti
    Acked-by: Yotam Gigi
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Davide Caratti
     
  • [ Upstream commit cb66ddd156203daefb8d71158036b27b0e2caf63 ]

    When it is to cleanup net namespace, rds_tcp_exit_net() will call
    rds_tcp_kill_sock(), if t_sock is NULL, it will not call
    rds_conn_destroy(), rds_conn_path_destroy() and rds_tcp_conn_free() to free
    connection, and the worker cp_conn_w is not stopped, afterwards the net is freed in
    net_drop_ns(); While cp_conn_w rds_connect_worker() will call rds_tcp_conn_path_connect()
    and reference 'net' which has already been freed.

    In rds_tcp_conn_path_connect(), rds_tcp_set_callbacks() will set t_sock = sock before
    sock->ops->connect, but if connect() is failed, it will call
    rds_tcp_restore_callbacks() and set t_sock = NULL, if connect is always
    failed, rds_connect_worker() will try to reconnect all the time, so
    rds_tcp_kill_sock() will never to cancel worker cp_conn_w and free the
    connections.

    Therefore, the condition !tc->t_sock is not needed if it is going to do
    cleanup_net->rds_tcp_exit_net->rds_tcp_kill_sock, because tc->t_sock is always
    NULL, and there is on other path to cancel cp_conn_w and free
    connection. So this patch is to fix this.

    rds_tcp_kill_sock():
    ...
    if (net != c_net || !tc->t_sock)
    ...
    Acked-by: Santosh Shilimkar

    ==================================================================
    BUG: KASAN: use-after-free in inet_create+0xbcc/0xd28
    net/ipv4/af_inet.c:340
    Read of size 4 at addr ffff8003496a4684 by task kworker/u8:4/3721

    CPU: 3 PID: 3721 Comm: kworker/u8:4 Not tainted 5.1.0 #11
    Hardware name: linux,dummy-virt (DT)
    Workqueue: krdsd rds_connect_worker
    Call trace:
    dump_backtrace+0x0/0x3c0 arch/arm64/kernel/time.c:53
    show_stack+0x28/0x38 arch/arm64/kernel/traps.c:152
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x120/0x188 lib/dump_stack.c:113
    print_address_description+0x68/0x278 mm/kasan/report.c:253
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x21c/0x348 mm/kasan/report.c:409
    __asan_report_load4_noabort+0x30/0x40 mm/kasan/report.c:429
    inet_create+0xbcc/0xd28 net/ipv4/af_inet.c:340
    __sock_create+0x4f8/0x770 net/socket.c:1276
    sock_create_kern+0x50/0x68 net/socket.c:1322
    rds_tcp_conn_path_connect+0x2b4/0x690 net/rds/tcp_connect.c:114
    rds_connect_worker+0x108/0x1d0 net/rds/threads.c:175
    process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
    worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
    kthread+0x2f0/0x378 kernel/kthread.c:255
    ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117

    Allocated by task 687:
    save_stack mm/kasan/kasan.c:448 [inline]
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xd4/0x180 mm/kasan/kasan.c:553
    kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:490
    slab_post_alloc_hook mm/slab.h:444 [inline]
    slab_alloc_node mm/slub.c:2705 [inline]
    slab_alloc mm/slub.c:2713 [inline]
    kmem_cache_alloc+0x14c/0x388 mm/slub.c:2718
    kmem_cache_zalloc include/linux/slab.h:697 [inline]
    net_alloc net/core/net_namespace.c:384 [inline]
    copy_net_ns+0xc4/0x2d0 net/core/net_namespace.c:424
    create_new_namespaces+0x300/0x658 kernel/nsproxy.c:107
    unshare_nsproxy_namespaces+0xa0/0x198 kernel/nsproxy.c:206
    ksys_unshare+0x340/0x628 kernel/fork.c:2577
    __do_sys_unshare kernel/fork.c:2645 [inline]
    __se_sys_unshare kernel/fork.c:2643 [inline]
    __arm64_sys_unshare+0x38/0x58 kernel/fork.c:2643
    __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
    invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
    el0_svc_common+0x168/0x390 arch/arm64/kernel/syscall.c:83
    el0_svc_handler+0x60/0xd0 arch/arm64/kernel/syscall.c:129
    el0_svc+0x8/0xc arch/arm64/kernel/entry.S:960

    Freed by task 264:
    save_stack mm/kasan/kasan.c:448 [inline]
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:521
    kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:528
    slab_free_hook mm/slub.c:1370 [inline]
    slab_free_freelist_hook mm/slub.c:1397 [inline]
    slab_free mm/slub.c:2952 [inline]
    kmem_cache_free+0xb8/0x3a8 mm/slub.c:2968
    net_free net/core/net_namespace.c:400 [inline]
    net_drop_ns.part.6+0x78/0x90 net/core/net_namespace.c:407
    net_drop_ns net/core/net_namespace.c:406 [inline]
    cleanup_net+0x53c/0x6d8 net/core/net_namespace.c:569
    process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
    worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
    kthread+0x2f0/0x378 kernel/kthread.c:255
    ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117

    The buggy address belongs to the object at ffff8003496a3f80
    which belongs to the cache net_namespace of size 7872
    The buggy address is located 1796 bytes inside of
    7872-byte region [ffff8003496a3f80, ffff8003496a5e40)
    The buggy address belongs to the page:
    page:ffff7e000d25a800 count:1 mapcount:0 mapping:ffff80036ce4b000
    index:0x0 compound_mapcount: 0
    flags: 0xffffe0000008100(slab|head)
    raw: 0ffffe0000008100 dead000000000100 dead000000000200 ffff80036ce4b000
    raw: 0000000000000000 0000000080040004 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8003496a4580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8003496a4600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff8003496a4680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8003496a4700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8003496a4780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Fixes: 467fa15356ac("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
    Reported-by: Hulk Robot
    Signed-off-by: Mao Wenan
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Mao Wenan
     
  • [ Upstream commit 355b98553789b646ed97ad801a619ff898471b92 ]

    net_hash_mix() currently uses kernel address of a struct net,
    and is used in many places that could be used to reveal this
    address to a patient attacker, thus defeating KASLR, for
    the typical case (initial net namespace, &init_net is
    not dynamically allocated)

    I believe the original implementation tried to avoid spending
    too many cycles in this function, but security comes first.

    Also provide entropy regardless of CONFIG_NET_NS.

    Fixes: 0b4419162aa6 ("netns: introduce the net_hash_mix "salt" for hashes")
    Signed-off-by: Eric Dumazet
    Reported-by: Amit Klein
    Reported-by: Benny Pinkas
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • [ Upstream commit 0ab03f353d3613ea49d1f924faf98559003670a8 ]

    Currently we may merge incorrectly a received GSO packet
    or a packet with frag_list into a packet sitting in the
    gro_hash list. skb_segment() may crash case because
    the assumptions on the skb layout are not met.
    The correct behaviour would be to flush the packet in the
    gro_hash list and send the received GSO packet directly
    afterwards. Commit d61d072e87c8e ("net-gro: avoid reorders")
    sets NAPI_GRO_CB(skb)->flush in this case, but this is not
    checked before merging. This patch makes sure to check this
    flag and to not merge in that case.

    Fixes: d61d072e87c8e ("net-gro: avoid reorders")
    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Steffen Klassert
     
  • [ Upstream commit 3d8830266ffc28c16032b859e38a0252e014b631 ]

    NULL or ZERO_SIZE_PTR will be returned for zero sized memory
    request, and derefencing them will lead to a segfault

    so it is unnecessory to call vzalloc for zero sized memory
    request and not call functions which maybe derefence the
    NULL allocated memory

    this also fixes a possible memory leak if phy_ethtool_get_stats
    returns error, memory should be freed before exit

    Signed-off-by: Li RongQing
    Reviewed-by: Wang Li
    Reviewed-by: Michal Kubecek
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Li RongQing
     
  • [ Upstream commit 3c446e6f96997f2a95bf0037ef463802162d2323 ]

    When kcm is loaded while many processes try to create a KCM socket, a
    crash occurs:
    BUG: unable to handle kernel NULL pointer dereference at 000000000000000e
    IP: mutex_lock+0x27/0x40 kernel/locking/mutex.c:240
    PGD 8000000016ef2067 P4D 8000000016ef2067 PUD 3d6e9067 PMD 0
    Oops: 0002 [#1] SMP KASAN PTI
    CPU: 0 PID: 7005 Comm: syz-executor.5 Not tainted 4.12.14-396-default #1 SLE15-SP1 (unreleased)
    RIP: 0010:mutex_lock+0x27/0x40 kernel/locking/mutex.c:240
    RSP: 0018:ffff88000d487a00 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 000000000000000e RCX: 1ffff100082b0719
    ...
    CR2: 000000000000000e CR3: 000000004b1bc003 CR4: 0000000000060ef0
    Call Trace:
    kcm_create+0x600/0xbf0 [kcm]
    __sock_create+0x324/0x750 net/socket.c:1272
    ...

    This is due to race between sock_create and unfinished
    register_pernet_device. kcm_create tries to do "net_generic(net,
    kcm_net_id)". but kcm_net_id is not initialized yet.

    So switch the order of the two to close the race.

    This can be reproduced with mutiple processes doing socket(PF_KCM, ...)
    and one process doing module removal.

    Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
    Reviewed-by: Michal Kubecek
    Signed-off-by: Jiri Slaby
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Jiri Slaby
     
  • [ Upstream commit bb9bd814ebf04f579be466ba61fc922625508807 ]

    ipip6 tunnels run iptunnel_pull_header on received skbs. This can
    determine the following use-after-free accessing iph pointer since
    the packet will be 'uncloned' running pskb_expand_head if it is a
    cloned gso skb (e.g if the packet has been sent though a veth device)

    [ 706.369655] BUG: KASAN: use-after-free in ipip6_rcv+0x1678/0x16e0 [sit]
    [ 706.449056] Read of size 1 at addr ffffe01b6bd855f5 by task ksoftirqd/1/=
    [ 706.669494] Hardware name: HPE ProLiant m400 Server/ProLiant m400 Server, BIOS U02 08/19/2016
    [ 706.771839] Call trace:
    [ 706.801159] dump_backtrace+0x0/0x2f8
    [ 706.845079] show_stack+0x24/0x30
    [ 706.884833] dump_stack+0xe0/0x11c
    [ 706.925629] print_address_description+0x68/0x260
    [ 706.982070] kasan_report+0x178/0x340
    [ 707.025995] __asan_report_load1_noabort+0x30/0x40
    [ 707.083481] ipip6_rcv+0x1678/0x16e0 [sit]
    [ 707.132623] tunnel64_rcv+0xd4/0x200 [tunnel4]
    [ 707.185940] ip_local_deliver_finish+0x3b8/0x988
    [ 707.241338] ip_local_deliver+0x144/0x470
    [ 707.289436] ip_rcv_finish+0x43c/0x14b0
    [ 707.335447] ip_rcv+0x628/0x1138
    [ 707.374151] __netif_receive_skb_core+0x1670/0x2600
    [ 707.432680] __netif_receive_skb+0x28/0x190
    [ 707.482859] process_backlog+0x1d0/0x610
    [ 707.529913] net_rx_action+0x37c/0xf68
    [ 707.574882] __do_softirq+0x288/0x1018
    [ 707.619852] run_ksoftirqd+0x70/0xa8
    [ 707.662734] smpboot_thread_fn+0x3a4/0x9e8
    [ 707.711875] kthread+0x2c8/0x350
    [ 707.750583] ret_from_fork+0x10/0x18

    [ 707.811302] Allocated by task 16982:
    [ 707.854182] kasan_kmalloc.part.1+0x40/0x108
    [ 707.905405] kasan_kmalloc+0xb4/0xc8
    [ 707.948291] kasan_slab_alloc+0x14/0x20
    [ 707.994309] __kmalloc_node_track_caller+0x158/0x5e0
    [ 708.053902] __kmalloc_reserve.isra.8+0x54/0xe0
    [ 708.108280] __alloc_skb+0xd8/0x400
    [ 708.150139] sk_stream_alloc_skb+0xa4/0x638
    [ 708.200346] tcp_sendmsg_locked+0x818/0x2b90
    [ 708.251581] tcp_sendmsg+0x40/0x60
    [ 708.292376] inet_sendmsg+0xf0/0x520
    [ 708.335259] sock_sendmsg+0xac/0xf8
    [ 708.377096] sock_write_iter+0x1c0/0x2c0
    [ 708.424154] new_sync_write+0x358/0x4a8
    [ 708.470162] __vfs_write+0xc4/0xf8
    [ 708.510950] vfs_write+0x12c/0x3d0
    [ 708.551739] ksys_write+0xcc/0x178
    [ 708.592533] __arm64_sys_write+0x70/0xa0
    [ 708.639593] el0_svc_handler+0x13c/0x298
    [ 708.686646] el0_svc+0x8/0xc

    [ 708.739019] Freed by task 17:
    [ 708.774597] __kasan_slab_free+0x114/0x228
    [ 708.823736] kasan_slab_free+0x10/0x18
    [ 708.868703] kfree+0x100/0x3d8
    [ 708.905320] skb_free_head+0x7c/0x98
    [ 708.948204] skb_release_data+0x320/0x490
    [ 708.996301] pskb_expand_head+0x60c/0x970
    [ 709.044399] __iptunnel_pull_header+0x3b8/0x5d0
    [ 709.098770] ipip6_rcv+0x41c/0x16e0 [sit]
    [ 709.146873] tunnel64_rcv+0xd4/0x200 [tunnel4]
    [ 709.200195] ip_local_deliver_finish+0x3b8/0x988
    [ 709.255596] ip_local_deliver+0x144/0x470
    [ 709.303692] ip_rcv_finish+0x43c/0x14b0
    [ 709.349705] ip_rcv+0x628/0x1138
    [ 709.388413] __netif_receive_skb_core+0x1670/0x2600
    [ 709.446943] __netif_receive_skb+0x28/0x190
    [ 709.497120] process_backlog+0x1d0/0x610
    [ 709.544169] net_rx_action+0x37c/0xf68
    [ 709.589131] __do_softirq+0x288/0x1018

    [ 709.651938] The buggy address belongs to the object at ffffe01b6bd85580
    which belongs to the cache kmalloc-1024 of size 1024
    [ 709.804356] The buggy address is located 117 bytes inside of
    1024-byte region [ffffe01b6bd85580, ffffe01b6bd85980)
    [ 709.946340] The buggy address belongs to the page:
    [ 710.003824] page:ffff7ff806daf600 count:1 mapcount:0 mapping:ffffe01c4001f600 index:0x0
    [ 710.099914] flags: 0xfffff8000000100(slab)
    [ 710.149059] raw: 0fffff8000000100 dead000000000100 dead000000000200 ffffe01c4001f600
    [ 710.242011] raw: 0000000000000000 0000000000380038 00000001ffffffff 0000000000000000
    [ 710.334966] page dumped because: kasan: bad access detected

    Fix it resetting iph pointer after iptunnel_pull_header

    Fixes: a09a4c8dd1ec ("tunnels: Remove encapsulation offloads on decap")
    Tested-by: Jianlin Shi
    Signed-off-by: Lorenzo Bianconi
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Lorenzo Bianconi
     
  • [ Upstream commit ef0efcd3bd3fd0589732b67fb586ffd3c8705806 ]

    At the beginning of ip6_fragment func, the prevhdr pointer is
    obtained in the ip6_find_1stfragopt func.
    However, all the pointers pointing into skb header may change
    when calling skb_checksum_help func with
    skb->ip_summed = CHECKSUM_PARTIAL condition.
    The prevhdr pointe will be dangling if it is not reloaded after
    calling __skb_linearize func in skb_checksum_help func.

    Here, I add a variable, nexthdr_offset, to evaluate the offset,
    which does not changes even after calling __skb_linearize func.

    Fixes: 405c92f7a541 ("ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment")
    Signed-off-by: Junwei Hu
    Reported-by: Wenhao Zhang
    Reported-by: syzbot+e8ce541d095e486074fc@syzkaller.appspotmail.com
    Reviewed-by: Zhiqiang Liu
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Junwei Hu
     
  • [ Upstream commit b2e54b09a3d29c4db883b920274ca8dca4d9f04d ]

    The device type for ip6 tunnels is set to
    ARPHRD_TUNNEL6. However, the ip4ip6_err function
    is expecting the device type of the tunnel to be
    ARPHRD_TUNNEL. Since the device types do not
    match, the function exits and the ICMP error
    packet is not sent to the originating host. Note
    that the device type for IPv4 tunnels is set to
    ARPHRD_TUNNEL.

    Fix is to expect a tunnel device type of
    ARPHRD_TUNNEL6 instead. Now the tunnel device
    type matches and the ICMP error packet is sent
    to the originating host.

    Signed-off-by: Sheena Mira-ato
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Sheena Mira-ato
     

06 Apr, 2019

4 commits

  • [ Upstream commit 8e2f311a68494a6677c1724bdcb10bada21af37c ]

    Following command:
    iptables -D FORWARD -m physdev ...
    causes connectivity loss in some setups.

    Reason is that iptables userspace will probe kernel for the module revision
    of the physdev patch, and physdev has an artificial dependency on
    br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
    is loaded).

    This causes the "phydev" module to be loaded, which in turn enables the
    "call-iptables" infrastructure.

    bridged packets might then get dropped by the iptables ruleset.

    The better fix would be to change the "call-iptables" defaults to 0 and
    enforce explicit setting to 1, but that breaks backwards compatibility.

    This does the next best thing: add a request_module call to checkentry.
    This was a stray '-D ... -m physdev' won't activate br_netfilter
    anymore.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Florian Westphal
     
  • [ Upstream commit 13f5251fd17088170c18844534682d9cab5ff5aa ]

    For bridge(br_flood) or broadcast/multicast packets, they could clone
    skb with unconfirmed conntrack which break the rule that unconfirmed
    skb->_nfct is never shared. With nfqueue running on my system, the race
    can be easily reproduced with following warning calltrace:

    [13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P W 4.4.60 #7744
    [13257.707568] Hardware name: Qualcomm (Flattened Device Tree)
    [13257.714700] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [13257.720253] [] (show_stack) from [] (dump_stack+0x94/0xa8)
    [13257.728240] [] (dump_stack) from [] (warn_slowpath_common+0x94/0xb0)
    [13257.735268] [] (warn_slowpath_common) from [] (warn_slowpath_null+0x1c/0x24)
    [13257.743519] [] (warn_slowpath_null) from [] (__nf_conntrack_confirm+0xa8/0x618)
    [13257.752284] [] (__nf_conntrack_confirm) from [] (ipv4_confirm+0xb8/0xfc)
    [13257.761049] [] (ipv4_confirm) from [] (nf_iterate+0x48/0xa8)
    [13257.769725] [] (nf_iterate) from [] (nf_hook_slow+0x30/0xb0)
    [13257.777108] [] (nf_hook_slow) from [] (br_nf_post_routing+0x274/0x31c)
    [13257.784486] [] (br_nf_post_routing) from [] (nf_iterate+0x48/0xa8)
    [13257.792556] [] (nf_iterate) from [] (nf_hook_slow+0x30/0xb0)
    [13257.800458] [] (nf_hook_slow) from [] (br_forward_finish+0x94/0xa4)
    [13257.808010] [] (br_forward_finish) from [] (br_nf_forward_finish+0x150/0x1ac)
    [13257.815736] [] (br_nf_forward_finish) from [] (nf_reinject+0x108/0x170)
    [13257.824762] [] (nf_reinject) from [] (nfqnl_recv_verdict+0x3d8/0x420)
    [13257.832924] [] (nfqnl_recv_verdict) from [] (nfnetlink_rcv_msg+0x158/0x248)
    [13257.841256] [] (nfnetlink_rcv_msg) from [] (netlink_rcv_skb+0x54/0xb0)
    [13257.849762] [] (netlink_rcv_skb) from [] (netlink_unicast+0x148/0x23c)
    [13257.858093] [] (netlink_unicast) from [] (netlink_sendmsg+0x2ec/0x368)
    [13257.866348] [] (netlink_sendmsg) from [] (sock_sendmsg+0x34/0x44)
    [13257.874590] [] (sock_sendmsg) from [] (___sys_sendmsg+0x1ec/0x200)
    [13257.882489] [] (___sys_sendmsg) from [] (__sys_sendmsg+0x3c/0x64)
    [13257.890300] [] (__sys_sendmsg) from [] (ret_fast_syscall+0x0/0x34)

    The original code just triggered the warning but do nothing. It will
    caused the shared conntrack moves to the dying list and the packet be
    droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack).

    - Reproduce steps:

    +----------------------------+
    | br0(bridge) |
    | |
    +-+---------+---------+------+
    | eth0| | eth1| | eth2|
    | | | | | |
    +--+--+ +--+--+ +---+-+
    | | |
    | | |
    +--+-+ +-+--+ +--+-+
    | PC1| | PC2| | PC3|
    +----+ +----+ +----+

    iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass

    ps: Our nfq userspace program will set mark on packets whose connection
    has already been processed.

    PC1 sends broadcast packets simulated by hping3:

    hping3 --rand-source --udp 192.168.1.255 -i u100

    - Broadcast racing flow chart is as follow:

    br_handle_frame
    BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish)
    // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage
    br_handle_frame_finish
    // check if this packet is broadcast
    br_flood_forward
    br_flood
    list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port
    maybe_deliver
    deliver_clone
    skb = skb_clone(skb)
    __br_forward
    BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...)
    // queue in our nfq and received by our userspace program
    // goto __nf_conntrack_confirm with process context on CPU 1
    br_pass_frame_up
    BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...)
    // goto __nf_conntrack_confirm with softirq context on CPU 0

    Because conntrack confirm can happen at both INPUT and POSTROUTING
    stage. So with NFQUEUE running, skb->_nfct with the same unconfirmed
    conntrack could race on different core.

    This patch fixes a repeating kernel splat, now it is only displayed
    once.

    Signed-off-by: Chieh-Min Wang
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Chieh-Min Wang
     
  • [ Upstream commit be0502a3f2e94211a8809a09ecbc3a017189b8fb ]

    TCP resets cause instant transition from established to closed state
    provided the reset is in-window. Endpoints that implement RFC 5961
    require resets to match the next expected sequence number.
    RST segments that are in-window (but that do not match RCV.NXT) are
    ignored, and a "challenge ACK" is sent back.

    Main problem for conntrack is that its a middlebox, i.e. whereas an end
    host might have ACK'd SEQ (and would thus accept an RST with this
    sequence number), conntrack might not have seen this ACK (yet).

    Therefore we can't simply flag RSTs with non-exact match as invalid.

    This updates RST processing as follows:

    1. If the connection is in a state other than ESTABLISHED, nothing is
    changed, RST is subject to normal in-window check.

    2. If the RSTs sequence number either matches exactly RCV.NXT,
    connection state moves to CLOSE.

    3. The same applies if the RST sequence number aligns with a previous
    packet in the same direction.

    In all other cases, the connection remains in ESTABLISHED state.
    If the normal-in-window check passes, the timeout will be lowered
    to that of CLOSE.

    If the peer sends a challenge ack, connection timeout will be reset.

    If the challenge ACK triggers another RST (RST was valid after all),
    this 2nd RST will match expected sequence and conntrack state changes to
    CLOSE.

    If no challenge ACK is received, the connection will time out after
    CLOSE seconds (10 seconds by default), just like without this patch.

    Packetdrill test case:

    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    0.000 bind(3, ..., ...) = 0
    0.000 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1 win 64240
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4

    // Receive a segment.
    0.210 < P. 1:1001(1000) ack 1 win 46
    0.210 > . 1:1(0) ack 1001

    // Application writes 1000 bytes.
    0.250 write(4, ..., 1000) = 1000
    0.250 > P. 1:1001(1000) ack 1001

    // First reset, old sequence. Conntrack (correctly) considers this
    // invalid due to failed window validation (regardless of this patch).
    0.260 < R 2:2(0) ack 1001 win 260

    // 2nd reset, but too far ahead sequence. Same: correctly handled
    // as invalid.
    0.270 < R 99990001:99990001(0) ack 1001 win 260

    // in-window, but not exact sequence.
    // Current Linux kernels might reply with a challenge ack, and do not
    // remove connection.
    // Without this patch, conntrack state moves to CLOSE.
    // With patch, timeout is lowered like CLOSE, but connection stays
    // in ESTABLISHED state.
    0.280 < R 1010:1010(0) ack 1001 win 260

    // Expect challenge ACK
    0.281 > . 1001:1001(0) ack 1001 win 501

    // With or without this patch, RST will cause connection
    // to move to CLOSE (sequence number matches)
    // 0.282 < R 1001:1001(0) ack 1001 win 260

    // ACK
    0.300 < . 1001:1001(0) ack 1001 win 257

    // more data could be exchanged here, connection
    // is still established

    // Client closes the connection.
    0.610 < F. 1001:1001(0) ack 1001 win 260
    0.650 > . 1001:1001(0) ack 1002

    // Close the connection without reading outstanding data
    0.700 close(4) = 0

    // so one more reset. Will be deemed acceptable with patch as well:
    // connection is already closing.
    0.701 > R. 1001:1001(0) ack 1002 win 501
    // End packetdrill test case.

    With patch, this generates following conntrack events:
    [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UNREPLIED]
    [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80
    [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
    [UPDATE] 120 FIN_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
    [UPDATE] 60 CLOSE_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
    [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]

    Without patch, first RST moves connection to close, whereas socket state
    does not change until FIN is received.
    [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UNREPLIED]
    [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80
    [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
    [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]

    Cc: Jozsef Kadlecsik
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Florian Westphal
     
  • [ Upstream commit a9f5e78c403d2d62ade4f4c85040efc85f4049b8 ]

    Check the result of dereferencing base_chain->stats, instead of result
    of this_cpu_ptr with NULL.

    base_chain->stats maybe be changed to NULL when a chain is updated and a
    new NULL counter can be attached.

    And we do not need to check returning of this_cpu_ptr since
    base_chain->stats is from percpu allocator if it is non-NULL,
    this_cpu_ptr returns a valid value.

    And fix two sparse error by replacing rcu_access_pointer and
    rcu_dereference with READ_ONCE under rcu_read_lock.

    Thanks for Eric's help to finish this patch.

    Fixes: 009240940e84c1 ("netfilter: nf_tables: don't assume chain stats are set when jumplabel is set")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Zhang Yu
    Signed-off-by: Li RongQing
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Li RongQing
     

03 Apr, 2019

3 commits

  • [ Upstream commit 064c5d6881e897077639e04973de26440ee205e6 ]

    A new mirred action is created by the tcf_mirred_init function. This
    contains a list head struct which is inserted into a global list on
    successful creation of a new action. However, after a creation, it is
    still possible to error out and call the tcf_idr_release function. This,
    in turn, calls the act_mirr cleanup function via __tcf_idr_release and
    __tcf_action_put. This cleanup function tries to delete the list entry
    which is as yet uninitialised, leading to a NULL pointer exception.

    Fix this by initialising the list entry on creation of a new action.

    Bug report:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    PGD 8000000840c73067 P4D 8000000840c73067 PUD 858dcc067 PMD 0
    Oops: 0002 [#1] SMP PTI
    CPU: 32 PID: 5636 Comm: handler194 Tainted: G OE 5.0.0+ #186
    Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.3.6 06/03/2015
    RIP: 0010:tcf_mirred_release+0x42/0xa7 [act_mirred]
    Code: f0 90 39 c0 e8 52 04 57 c8 48 c7 c7 b8 80 39 c0 e8 94 fa d4 c7 48 8b 93 d0 00 00 00 48 8b 83 d8 00 00 00 48 c7 c7 f0 90 39 c0 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 83 d0 00
    RSP: 0018:ffffac4aa059f688 EFLAGS: 00010282
    RAX: 0000000000000000 RBX: ffff9dcd1b214d00 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff9dcd1fa165f8 RDI: ffffffffc03990f0
    RBP: ffff9dccf9c7af80 R08: 0000000000000a3b R09: 0000000000000000
    R10: ffff9dccfa11f420 R11: 0000000000000000 R12: 0000000000000001
    R13: ffff9dcd16b433c0 R14: ffff9dcd1b214d80 R15: 0000000000000000
    FS: 00007f441bfff700(0000) GS:ffff9dcd1fa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 0000000839e64004 CR4: 00000000001606e0
    Call Trace:
    tcf_action_cleanup+0x59/0xca
    __tcf_action_put+0x54/0x6b
    __tcf_idr_release.cold.33+0x9/0x12
    tcf_mirred_init.cold.20+0x22e/0x3b0 [act_mirred]
    tcf_action_init_1+0x3d0/0x4c0
    tcf_action_init+0x9c/0x130
    tcf_exts_validate+0xab/0xc0
    fl_change+0x1ca/0x982 [cls_flower]
    tc_new_tfilter+0x647/0x8d0
    ? load_balance+0x14b/0x9e0
    rtnetlink_rcv_msg+0xe3/0x370
    ? __switch_to_asm+0x40/0x70
    ? __switch_to_asm+0x34/0x70
    ? _cond_resched+0x15/0x30
    ? __kmalloc_node_track_caller+0x1d4/0x2b0
    ? rtnl_calcit.isra.31+0xf0/0xf0
    netlink_rcv_skb+0x49/0x110
    netlink_unicast+0x16f/0x210
    netlink_sendmsg+0x1df/0x390
    sock_sendmsg+0x36/0x40
    ___sys_sendmsg+0x27b/0x2c0
    ? futex_wake+0x80/0x140
    ? do_futex+0x2b9/0xac0
    ? ep_scan_ready_list.constprop.22+0x1f2/0x210
    ? ep_poll+0x7a/0x430
    __sys_sendmsg+0x47/0x80
    do_syscall_64+0x55/0x100
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: 4e232818bd32 ("net: sched: act_mirred: remove dependency on rtnl lock")
    Signed-off-by: John Hurley
    Reviewed-by: Jakub Kicinski
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    John Hurley
     
  • [ Upstream commit b5f9bd15b88563b55a99ed588416881367a0ce5f ]

    ila_xlat_nl_cmd_flush uses rhashtable walkers allocated from the
    stack but it never frees them. This corrupts the walker list of
    the hash table.

    This patch fixes it.

    Reported-by: syzbot+dae72a112334aa65a159@syzkaller.appspotmail.com
    Fixes: b6e71bdebb12 ("ila: Flush netlink command to clear xlat...")
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit 33872d79f5d1cbedaaab79669cc38f16097a9450 ]

    When cancelling a subscription, we have to clear the cancel bit in the
    request before iterating over any established subscriptions with memcmp.
    Otherwise no subscription will ever be found, and it will not be
    possible to explicitly unsubscribe individual subscriptions.

    Fixes: 8985ecc7c1e0 ("tipc: simplify endianness handling in topology subscriber")
    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Erik Hugne