12 Feb, 2019

1 commit

  • [Patch] Pulling the following commits and some general changes
    from custom v3.10 kernel for supporting qcacld2.0 on kernel v4.9.11.
    1. cfg80211: Using new wiphy flag WIPHY_FLAG_DFS_OFFLOAD
    When flag WIPHY_FLAG_DFS_OFFLOAD is defined, the driver would handle
    all the DFS related operations. Therefore the kernel needs to ignore
    the DFS state that it uses to block the userspace calls to the driver
    through cfg80211 APIs. Also it should treat the userspace calls to
    start radar detection as a no-op.

    Please note that changes in util.c is not picked up explicitly.
    Kernel v4.9.11 uses wrapper cfg80211_get_chans_dfs_required which takes
    care of this change.

    Change-Id: I9dd2076945581ca67e54dfc96dd3dbc526c6f0a2
    IRs-Fixed: 202686

    2. New db.txt from git/sforshee/wireless-regdb.git
    CONFIG_CFG80211_INTERNAL_REGDB is enabled in build. This causes
    kernel warn messages as db.txt is empty. A new db.txt is added
    from:
    git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/wireless-regdb.git

    IRs-Fixed: 202686

    3. Picked up the declaration and definition of the function
    cfg80211_is_gratuitous_arp_unsolicited_na

    Change-Id: I1e4083a2327c121073226aa6b75bb6b5b97cec00
    CRs-fixed: 1079453

    Signed-off-by: Nakul Kachhwaha
    Signed-off-by: Fugang Duan

    Nakul Kachhwaha
     

07 Feb, 2019

1 commit

  • [ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ]

    While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
    I ran into the issue that while l3 mode is working fine, l3s mode
    does not have any connectivity to kube-apiserver and hence all pods
    end up in Error state as well. The ipvlan master device sits on
    top of a bond device and hostns traffic to kube-apiserver (also running
    in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
    where the latter is the address of the bond0. While in l3 mode, a
    curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
    works fine from hostns, neither of them do in case of l3s. In the
    latter only a curl to https://127.0.0.1:37573 appeared to work where
    for local addresses of bond0 I saw kernel suddenly starting to emit
    ARP requests to query HW address of bond0 which remained unanswered
    and neighbor entries in INCOMPLETE state. These ARP requests only
    happen while in l3s.

    Debugging this further, I found the issue is that l3s mode is piggy-
    backing on l3 master device, and in this case local routes are using
    l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
    f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev
    if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be
    a loopback"). I found that reverting them back into using the
    net->loopback_dev fixed ipvlan l3s connectivity and got everything
    working for the CNI.

    Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the
    l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
    on l3 master device is to get the l3mdev_ip_rcv() receive hook for
    setting the dst entry of the input route without adding its own
    ipvlan specific hacks into the receive path, however, any l3 domain
    semantics beyond just that are breaking l3s operation. Note that
    ipvlan also has the ability to dynamically switch its internal
    operation from l3 to l3s for all ports via ipvlan_set_port_mode()
    at runtime. In any case, l3 vs l3s soley distinguishes itself by
    'de-confusing' netfilter through switching skb->dev to ipvlan slave
    device late in NF_INET_LOCAL_IN before handing the skb to L4.

    Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
    if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
    without any additional l3mdev semantics on top. This should also have
    minimal impact since dev->priv_flags is already hot in cache. With
    this set, l3s mode is working fine and I also get things like
    masquerading pod traffic on the ipvlan master properly working.

    [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

    Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
    Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback")
    Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode")
    Signed-off-by: Daniel Borkmann
    Cc: Mahesh Bandewar
    Cc: David Ahern
    Cc: Florian Westphal
    Cc: Martynas Pumputis
    Acked-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

31 Jan, 2019

1 commit

  • [ Upstream commit f97f4dd8b3bb9d0993d2491e0f22024c68109184 ]

    IPv4 routing tables are flushed in two cases:

    1. In response to events in the netdev and inetaddr notification chains
    2. When a network namespace is being dismantled

    In both cases only routes associated with a dead nexthop group are
    flushed. However, a nexthop group will only be marked as dead in case it
    is populated with actual nexthops using a nexthop device. This is not
    the case when the route in question is an error route (e.g.,
    'blackhole', 'unreachable').

    Therefore, when a network namespace is being dismantled such routes are
    not flushed and leaked [1].

    To reproduce:
    # ip netns add blue
    # ip -n blue route add unreachable 192.0.2.0/24
    # ip netns del blue

    Fix this by not skipping error routes that are not marked with
    RTNH_F_DEAD when flushing the routing tables.

    To prevent the flushing of such routes in case #1, add a parameter to
    fib_table_flush() that indicates if the table is flushed as part of
    namespace dismantle or not.

    Note that this problem does not exist in IPv6 since error routes are
    associated with the loopback device.

    [1]
    unreferenced object 0xffff888066650338 (size 56):
    comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff ..........ba....
    e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00 ...d............
    backtrace:
    [] inet_rtm_newroute+0x129/0x220
    [] rtnetlink_rcv_msg+0x397/0xa20
    [] netlink_rcv_skb+0x132/0x380
    [] netlink_unicast+0x4c0/0x690
    [] netlink_sendmsg+0x929/0xe10
    [] sock_sendmsg+0xc8/0x110
    [] ___sys_sendmsg+0x77a/0x8f0
    [] __sys_sendmsg+0xf7/0x250
    [] do_syscall_64+0x14d/0x610
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff
    unreferenced object 0xffff888061621c88 (size 48):
    comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
    hex dump (first 32 bytes):
    6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
    6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff kkkkkkkk..&_....
    backtrace:
    [] fib_table_insert+0x978/0x1500
    [] inet_rtm_newroute+0x129/0x220
    [] rtnetlink_rcv_msg+0x397/0xa20
    [] netlink_rcv_skb+0x132/0x380
    [] netlink_unicast+0x4c0/0x690
    [] netlink_sendmsg+0x929/0xe10
    [] sock_sendmsg+0xc8/0x110
    [] ___sys_sendmsg+0x77a/0x8f0
    [] __sys_sendmsg+0xf7/0x250
    [] do_syscall_64+0x14d/0x610
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    Fixes: 8cced9eff1d4 ("[NETNS]: Enable routing configuration in non-initial namespace.")
    Signed-off-by: Ido Schimmel
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     

10 Jan, 2019

3 commits

  • commit 21ba8847f857028dc83a0f341e16ecc616e34740 upstream.

    Currently, we use check_hlist() for garbage colleciton. However, we
    use the ‘zone’ from the counted entry to query the existence of
    existing entries in the hlist. This could be wrong when they are in
    different zones, and this patch fixes this issue.

    Fixes: e59ea3df3fc2 ("netfilter: xt_connlimit: honor conntrack zone if available")
    Signed-off-by: Yi-Hung Wei
    Signed-off-by: Pablo Neira Ayuso

    [mfo: backport: refresh context lines and use older symbol/file names, note hunk 5:
    - nf_conncount.c -> xt_connlimit.c
    - nf_conncount_rb -> xt_connlimit_rb
    - nf_conncount_tuple -> xt_connlimit_conn
    - hunk 5: remove check for non-NULL 'tuple', that isn't required as it's introduced
    by upstream commit 35d8deb80 ("netfilter: conncount: Support count only use case")
    which addresses nf_conncount_count() that does not exist yet -- it's introduced by
    upstream commit 625c556118f3 ("netfilter: connlimit: split xt_connlimit into front
    and backend"), a refactor change.
    - nft_connlimit.c -> removed, not used/doesn't exist yet.]
    Signed-off-by: Mauricio Faria de Oliveira

    Signed-off-by: Sasha Levin

    Yi-Hung Wei
     
  • commit 5e5cbc7b23eaf13e18652c03efbad5be6995de6a upstream.

    This patch provides an interface to maintain the list of connections and
    the lookup function to obtain the number of connections in the list.

    Signed-off-by: Pablo Neira Ayuso

    [mfo: backport: refresh context lines and use older symbol/file names:
    - nf_conntrack_count.h: new file, add include guards.
    - nf_conncount.c -> xt_connlimit.c.
    - nf_conncount_rb -> xt_connlimit_rb
    - nf_conncount_tuple -> xt_connlimit_conn
    - conncount_rb_cachep -> connlimit_rb_cachep
    - conncount_conn_cachep -> connlimit_conn_cachep]
    Signed-off-by: Mauricio Faria de Oliveira

    Signed-off-by: Sasha Levin

    Pablo Neira Ayuso
     
  • [ Upstream commit 3a0ed3e9619738067214871e9cb826fa23b2ddb9 ]

    Al Viro mentioned (Message-ID
    )
    that there is probably a race condition
    lurking in accesses of sk_stamp on 32-bit machines.

    sock->sk_stamp is of type ktime_t which is always an s64.
    On a 32 bit architecture, we might run into situations of
    unsafe access as the access to the field becomes non atomic.

    Use seqlocks for synchronization.
    This allows us to avoid using spinlocks for readers as
    readers do not need mutual exclusion.

    Another approach to solve this is to require sk_lock for all
    modifications of the timestamps. The current approach allows
    for timestamps to have their own lock: sk_stamp_lock.
    This allows for the patch to not compete with already
    existing critical sections, and side effects are limited
    to the paths in the patch.

    The addition of the new field maintains the data locality
    optimizations from
    commit 9115e8cd2a0c ("net: reorganize struct sock for better data
    locality")

    Note that all the instances of the sk_stamp accesses
    are either through the ioctl or the syscall recvmsg.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Deepa Dinamani
     

17 Dec, 2018

2 commits

  • [ Upstream commit fb6df5a6234c38a9c551559506a49a677ac6f07a ]

    In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
    a transport's asoc under rcu_read_lock while asoc is freed not after
    a grace period, which leads to a use-after-free panic.

    This patch fixes it by calling kfree_rcu to make asoc be freed after
    a grace period.

    Note that only the asoc's memory is delayed to free in the patch, it
    won't cause sk to linger longer.

    Thanks Neil and Marcelo to make this clear.

    Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable")
    Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new transport")
    Reported-by: syzbot+0b05d8aa7cb185107483@syzkaller.appspotmail.com
    Reported-by: syzbot+aad231d51b1923158444@syzkaller.appspotmail.com
    Suggested-by: Neil Horman
    Signed-off-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • [ Upstream commit e6ac64d4c4d095085d7dd71cbd05704ac99829b2 ]

    While skb_push() makes the kernel panic if the skb headroom is less than
    the unaligned hardware header size, it will proceed normally in case we
    copy more than that because of alignment, and we'll silently corrupt
    adjacent slabs.

    In the case fixed by the previous patch,
    "ipv6: Check available headroom in ip6_xmit() even without options", we
    end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
    header and write 16 bytes, starting 2 bytes before the allocated buffer.

    Always check we're not writing before skb->head and, if the headroom is
    not enough, warn and drop the packet.

    v2:
    - instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet
    (Eric Dumazet)
    - if we avoid the panic, though, we need to explicitly check the headroom
    before the memcpy(), otherwise we'll have corrupted slabs on a running
    kernel, after we warn
    - use __skb_push() instead of skb_push(), as the headroom check is
    already implemented here explicitly (Eric Dumazet)

    Signed-off-by: Stefano Brivio
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stefano Brivio
     

06 Dec, 2018

2 commits

  • commit ff45d820a2df163957ad8ab459b6eb6976144c18 upstream.

    Previously the TLS ulp context would leak if we attached a TLS ulp
    to a socket but did not use the TLS_TX setsockopt,
    or did use it but it failed.
    This patch solves the issue by overriding prot[TLS_BASE_TX].close
    and fixing tls_sk_proto_close to work properly
    when its called with ctx->tx_conf == TLS_BASE_TX.
    This patch also removes ctx->free_resources as we can use ctx->tx_conf
    to obtain the relevant information.

    Fixes: 3c4d7559159b ('tls: kernel TLS support')
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller
    [bwh: Backported to 4.14: Keep using tls_ctx_free() as introduced by
    the earlier backport of "tls: zero the crypto information from
    tls_context before freeing"]
    Signed-off-by: Ben Hutchings
    Signed-off-by: Sasha Levin

    Ilya Lesokhin
     
  • commit 6d88207fcfddc002afe3e2e4a455e5201089d5d9 upstream.

    The tx configuration is now stored in ctx->tx_conf.
    And sk->sk_prot is updated trough a function
    This will simplify things when we add rx
    and support for different possible
    tx and rx cross configurations.

    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller
    Signed-off-by: Ben Hutchings
    Signed-off-by: Sasha Levin

    Ilya Lesokhin
     

01 Dec, 2018

1 commit

  • commit 8873c064d1de579ea23412a6d3eee972593f142b upstream.

    syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
    in tcp_close()

    While a socket is being closed, it is very possible other
    threads find it in rtnetlink dump.

    tcp_get_info() will acquire the socket lock for a short amount
    of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
    enough to trigger the warning.

    Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

18 Oct, 2018

4 commits

  • [ Upstream commit 2ab2ddd301a22ca3c5f0b743593e4ad2953dfa53 ]

    Timer handlers do not imply rcu_read_lock(), so my recent fix
    triggered a LOCKDEP warning when SYNACK is retransmit.

    Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
    usages instead of guessing what is done by callers, since it is
    not worth the pain.

    Get rid of ireq_opt_deref() helper since it hides the logic
    without real benefit, since it is now a standard rcu_dereference().

    Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 1ad98e9d1bdf4724c0a8532fabd84bf3c457c2bc ]

    In normal SYN processing, packets are handled without listener
    lock and in RCU protected ingress path.

    But syzkaller is known to be able to trick us and SYN
    packets might be processed in process context, after being
    queued into socket backlog.

    In commit 06f877d613be ("tcp/dccp: fix other lockdep splats
    accessing ireq_opt") I made a very stupid fix, that happened
    to work mostly because of the regular path being RCU protected.

    Really the thing protecting ireq->ireq_opt is RCU read lock,
    and the pseudo request refcnt is not relevant.

    This patch extends what I did in commit 449809a66c1d ("tcp/dccp:
    block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
    pair in the paths that might be taken when processing SYN from
    socket backlog (thus possibly in process context)

    Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit af7d6cce53694a88d6a1bb60c9a239a6a5144459 ]

    Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop
    exceptions"), exceptions get deprecated separately from cached
    routes. In particular, administrative changes don't clear PMTU anymore.

    As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes
    on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
    the local MTU change can become stale:
    - if the local MTU is now lower than the PMTU, that PMTU is now
    incorrect
    - if the local MTU was the lowest value in the path, and is increased,
    we might discover a higher PMTU

    Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those
    cases.

    If the exception was locked, the discovered PMTU was smaller than the
    minimal accepted PMTU. In that case, if the new local MTU is smaller
    than the current PMTU, let PMTU discovery figure out if locking of the
    exception is still needed.

    To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
    notifier. By the time the notifier is called, dev->mtu has been
    changed. This patch adds the old MTU as additional information in the
    notifier structure, and a new call_netdevice_notifiers_u32() function.

    Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
    Signed-off-by: Sabrina Dubroca
    Reviewed-by: Stefano Brivio
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     
  • [ Upstream commit d4859d749aa7090ffb743d15648adb962a1baeae ]

    Syzkaller reported this on a slightly older kernel but it's still
    applicable to the current kernel -

    ======================================================
    WARNING: possible circular locking dependency detected
    4.18.0-next-20180823+ #46 Not tainted
    ------------------------------------------------------
    syz-executor4/26841 is trying to acquire lock:
    00000000dd41ef48 ((wq_completion)bond_dev->name){+.+.}, at: flush_workqueue+0x2db/0x1e10 kernel/workqueue.c:2652

    but task is already holding lock:
    00000000768ab431 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:77 [inline]
    00000000768ab431 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x412/0xc30 net/core/rtnetlink.c:4708

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (rtnl_mutex){+.+.}:
    __mutex_lock_common kernel/locking/mutex.c:925 [inline]
    __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
    mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
    rtnl_lock+0x17/0x20 net/core/rtnetlink.c:77
    bond_netdev_notify drivers/net/bonding/bond_main.c:1310 [inline]
    bond_netdev_notify_work+0x44/0xd0 drivers/net/bonding/bond_main.c:1320
    process_one_work+0xc73/0x1aa0 kernel/workqueue.c:2153
    worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
    kthread+0x35a/0x420 kernel/kthread.c:246
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415

    -> #1 ((work_completion)(&(&nnw->work)->work)){+.+.}:
    process_one_work+0xc0b/0x1aa0 kernel/workqueue.c:2129
    worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
    kthread+0x35a/0x420 kernel/kthread.c:246
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415

    -> #0 ((wq_completion)bond_dev->name){+.+.}:
    lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
    flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655
    drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820
    destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155
    __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138
    bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734
    register_netdevice+0x337/0x1100 net/core/dev.c:8410
    bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453
    rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099
    rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711
    netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454
    rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729
    netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
    netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
    netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg+0xd5/0x120 net/socket.c:632
    ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115
    __sys_sendmsg+0x11d/0x290 net/socket.c:2153
    __do_sys_sendmsg net/socket.c:2162 [inline]
    __se_sys_sendmsg net/socket.c:2160 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Chain exists of:
    (wq_completion)bond_dev->name --> (work_completion)(&(&nnw->work)->work) --> rtnl_mutex

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(rtnl_mutex);
    lock((work_completion)(&(&nnw->work)->work));
    lock(rtnl_mutex);
    lock((wq_completion)bond_dev->name);

    *** DEADLOCK ***

    1 lock held by syz-executor4/26841:

    stack backtrace:
    CPU: 1 PID: 26841 Comm: syz-executor4 Not tainted 4.18.0-next-20180823+ #46
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
    print_circular_bug.isra.34.cold.55+0x1bd/0x27d kernel/locking/lockdep.c:1222
    check_prev_add kernel/locking/lockdep.c:1862 [inline]
    check_prevs_add kernel/locking/lockdep.c:1975 [inline]
    validate_chain kernel/locking/lockdep.c:2416 [inline]
    __lock_acquire+0x3449/0x5020 kernel/locking/lockdep.c:3412
    lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
    flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655
    drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820
    destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155
    __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138
    bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734
    register_netdevice+0x337/0x1100 net/core/dev.c:8410
    bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453
    rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099
    rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711
    netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454
    rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729
    netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
    netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
    netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg+0xd5/0x120 net/socket.c:632
    ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115
    __sys_sendmsg+0x11d/0x290 net/socket.c:2153
    __do_sys_sendmsg net/socket.c:2162 [inline]
    __se_sys_sendmsg net/socket.c:2160 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x457089
    Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f2df20a5c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00007f2df20a66d4 RCX: 0000000000457089
    RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
    RBP: 0000000000930140 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
    R13: 00000000004d40b8 R14: 00000000004c8ad8 R15: 0000000000000001

    Signed-off-by: Mahesh Bandewar
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Mahesh Bandewar
     

29 Sep, 2018

2 commits

  • commit e285d5bfb7e9785d289663baef252dd315e171f8 upstream.

    According to ETSI TS 102 622 specification chapter 4.4 pipe identifier
    is 7 bits long which allows for 128 unique pipe IDs. Because
    NFC_HCI_MAX_PIPES is used as the number of pipes supported and not
    as the max pipe ID, its value should be 128 instead of 127.

    nfc_hci_recv_from_llc extracts pipe ID from packet header using
    NFC_HCI_FRAGMENT(0x7F) mask which allows for pipe ID value of 127.
    Same happens when NCI_HCP_MSG_GET_PIPE() is being used. With
    pipes array having only 127 elements and pipe ID of 127 the OOB memory
    access will result.

    Cc: Samuel Ortiz
    Cc: Allen Pais
    Cc: "David S. Miller"
    Suggested-by: Dan Carpenter
    Signed-off-by: Suren Baghdasaryan
    Reviewed-by: Kees Cook
    Cc: stable
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     
  • [ Upstream commit 86029d10af18381814881d6cce2dd6872163b59f ]

    This contains key material in crypto_send_aes_gcm_128 and
    crypto_recv_aes_gcm_128.

    Introduce union tls_crypto_context, and replace the two identical
    unions directly embedded in struct tls_context with it. We can then
    use this union to clean up the memory in the new tls_ctx_free()
    function.

    Fixes: 3c4d7559159b ("tls: kernel TLS support")
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     

20 Sep, 2018

11 commits

  • This patch introduces several helper functions/macros that will be
    used in the follow-up patch. No runtime changes yet.

    The new logic (fully implemented in the second patch) is as follows:

    * Nodes in the rb-tree will now contain not single fragments, but lists
    of consecutive fragments ("runs").

    * At each point in time, the current "active" run at the tail is
    maintained/tracked. Fragments that arrive in-order, adjacent
    to the previous tail fragment, are added to this tail run without
    triggering the re-balancing of the rb-tree.

    * If a fragment arrives out of order with the offset _before_ the tail run,
    it is inserted into the rb-tree as a single fragment.

    * If a fragment arrives after the current tail fragment (with a gap),
    it starts a new "tail" run, as is inserted into the rb-tree
    at the end as the head of the new run.

    skb->cb is used to store additional information
    needed here (suggested by Eric Dumazet).

    Reported-by: Willem de Bruijn
    Signed-off-by: Peter Oskolkov
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

    skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

    Current uses (TCP receive ofo queue and netem) need to save/restore
    tstamp, while skb->dev is either NULL (TCP) or a constant for a given
    queue (netem).

    Since we plan using an RB tree for TCP retransmit queue to speedup SACK
    processing with large BDP, this patch exchanges skb->dev and
    skb->tstamp.

    This saves some overhead in both TCP and netem.

    v2: removes the swtstamp field from struct tcp_skb_cb

    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Wei Wang
    Cc: Willem de Bruijn
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Put the read-mostly fields in a separate cache line
    at the beginning of struct netns_frags, to reduce
    false sharing noticed in inet_frag_kill()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit c2615cf5a761b32bf74e85bddc223dfff3d9b9f0)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some users are willing to provision huge amounts of memory to be able
    to perform reassembly reasonnably well under pressure.

    Current memory tracking is using one atomic_t and integers.

    Switch to atomic_long_t so that 64bit arches can use more than 2GB,
    without any cost for 32bit arches.

    Note that this patch avoids an overflow error, if high_thresh was set
    to ~2GB, since this test in inet_frag_alloc() was never true :

    if (... || frag_mem_limit(nf) > nf->high_thresh)

    Tested:

    $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 14705885 memory 16000002880

    $ nstat -n ; sleep 1 ; nstat | grep Reas
    IpReasmReqds 3317150 0.0
    IpReasmFails 3317112 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This function is obsolete, after rhashtable addition to inet defrag.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This refactors ip_expire() since one indentation level is removed.

    Note: in the future, we should try hard to avoid the skb_clone()
    since this is a serious performance cost.
    Under DDOS, the ICMP message wont be sent because of rate limits.

    Fact that ip6_expire_frag_queue() does not use skb_clone() is
    disturbing too. Presumably IPv6 should have the same
    issue than the one we fixed in commit ec4fbd64751d
    ("inet: frag: release spinlock before calling icmp_send()")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 399d1404be660d355192ff4df5ccc3f4159ec1e4)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

    Also since we use rhashtable we can bring back the number of fragments
    in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
    removed in commit 434d305405ab ("inet: frag: don't account number
    of fragment queues")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 6befe4a78b1553edb6eed3a78b4bcd9748526672)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some applications still rely on IP fragmentation, and to be fair linux
    reassembly unit is not working under any serious load.

    It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

    A work queue is supposed to garbage collect items when host is under memory
    pressure, and doing a hash rebuild, changing seed used in hash computations.

    This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
    occurring every 5 seconds if host is under fire.

    Then there is the problem of sharing this hash table for all netns.

    It is time to switch to rhashtables, and allocate one of them per netns
    to speedup netns dismantle, since this is a critical metric these days.

    Lookup is now using RCU. A followup patch will even remove
    the refcount hold/release left from prior implementation and save
    a couple of atomic operations.

    Before this patch, 16 cpus (16 RX queue NIC) could not handle more
    than 1 Mpps frags DDOS.

    After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
    of storage for the fragments (exact number depends on frags being evicted
    after timeout)

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 1966916 memory 2140004608

    A followup patch will change the limits for 64bit arches.

    Signed-off-by: Eric Dumazet
    Cc: Kirill Tkhai
    Cc: Herbert Xu
    Cc: Florian Westphal
    Cc: Jesper Dangaard Brouer
    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Signed-off-by: David S. Miller
    (cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • In preparation for unconditionally passing the struct timer_list pointer to
    all timer callbacks, switch to using the new timer_setup() and from_timer()
    to pass the timer pointer explicitly.

    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: Pablo Neira Ayuso
    Cc: Jozsef Kadlecsik
    Cc: Florian Westphal
    Cc: linux-wpan@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: netfilter-devel@vger.kernel.org
    Cc: coreteam@netfilter.org
    Signed-off-by: Kees Cook
    Acked-by: Stefan Schmidt # for ieee802154
    Signed-off-by: David S. Miller
    (cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • In order to simplify the API, add a pointer to struct inet_frags.
    This will allow us to make things less complex.

    These functions no longer have a struct inet_frags parameter :

    inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
    ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 093ba72914b696521e4885756a68a3332782c8de)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • We will soon initialize one rhashtable per struct netns_frags
    in inet_frags_init_net().

    This patch changes the return value to eventually propagate an
    error.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 787bea7748a76130566f881c2342a0be4127d182)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

15 Sep, 2018

1 commit

  • [ Upstream commit 037b0b86ecf5646f8eae777d8b52ff8b401692ec ]

    Lets not turn the TCP ULP lookup into an arbitrary module loader as
    we only intend to load ULP modules through this mechanism, not other
    unrelated kernel modules:

    [root@bar]# cat foo.c
    #include
    #include
    #include
    #include

    int main(void)
    {
    int sock = socket(PF_INET, SOCK_STREAM, 0);
    setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp"));
    return 0;
    }

    [root@bar]# gcc foo.c -O2 -Wall
    [root@bar]# lsmod | grep sctp
    [root@bar]# ./a.out
    [root@bar]# lsmod | grep sctp
    sctp 1077248 4
    libcrc32c 16384 3 nf_conntrack,nf_nat,sctp
    [root@bar]#

    Fix it by adding module alias to TCP ULP modules, so probing module
    via request_module() will be limited to tcp-ulp-[name]. The existing
    modules like kTLS will load fine given tcp-ulp-tls alias, but others
    will fail to load:

    [root@bar]# lsmod | grep sctp
    [root@bar]# ./a.out
    [root@bar]# lsmod | grep sctp
    [root@bar]#

    Sockmap is not affected from this since it's either built-in or not.

    Fixes: 734942cc4ea6 ("tcp: ULP infrastructure")
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

24 Aug, 2018

4 commits

  • [ Upstream commit a69258f7aa2623e0930212f09c586fd06674ad79 ]

    After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
    related callbacks are no longer needed

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Lawrence Brakmo
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     
  • [ Upstream commit 38230a3e0e0933bbcf5df6fa469ba0667f667568 ]

    the control action in the common member of struct tcf_tunnel_key must be a
    valid value, as it can contain the chain index when 'goto chain' is used.
    Ensure that the control action can be read as x->tcfa_action, when x is a
    pointer to struct tc_action and x->ops->type is TCA_ACT_TUNNEL_KEY, to
    prevent the following command:

    # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
    > $tcflags dst_mac $h2mac action tunnel_key unset goto chain 1

    from causing a NULL dereference when a matching packet is received:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    PGD 80000001097ac067 P4D 80000001097ac067 PUD 103b0a067 PMD 0
    Oops: 0000 [#1] SMP PTI
    CPU: 0 PID: 3491 Comm: mausezahn Tainted: G E 4.18.0-rc2.auguri+ #421
    Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
    RIP: 0010:tcf_action_exec+0xb8/0x100
    Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
    RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
    RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
    RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
    R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
    R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
    FS: 00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
    Call Trace:

    fl_classify+0x1ad/0x1c0 [cls_flower]
    ? __update_load_avg_se.isra.47+0x1ca/0x1d0
    ? __update_load_avg_se.isra.47+0x1ca/0x1d0
    ? update_load_avg+0x665/0x690
    ? update_load_avg+0x665/0x690
    ? kmem_cache_alloc+0x38/0x1c0
    tcf_classify+0x89/0x140
    __netif_receive_skb_core+0x5ea/0xb70
    ? enqueue_entity+0xd0/0x270
    ? process_backlog+0x97/0x150
    process_backlog+0x97/0x150
    net_rx_action+0x14b/0x3e0
    __do_softirq+0xde/0x2b4
    do_softirq_own_stack+0x2a/0x40

    do_softirq.part.18+0x49/0x50
    __local_bh_enable_ip+0x49/0x50
    __dev_queue_xmit+0x4ab/0x8a0
    ? wait_woken+0x80/0x80
    ? packet_sendmsg+0x38f/0x810
    ? __dev_queue_xmit+0x8a0/0x8a0
    packet_sendmsg+0x38f/0x810
    sock_sendmsg+0x36/0x40
    __sys_sendto+0x10e/0x140
    ? do_vfs_ioctl+0xa4/0x630
    ? syscall_trace_enter+0x1df/0x2e0
    ? __audit_syscall_exit+0x22a/0x290
    __x64_sys_sendto+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fd67e18dc93
    Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
    RSP: 002b:00007ffe0189b748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 00000000020ca010 RCX: 00007fd67e18dc93
    RDX: 0000000000000062 RSI: 00000000020ca322 RDI: 0000000000000003
    RBP: 00007ffe0189b780 R08: 00007ffe0189b760 R09: 0000000000000014
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
    R13: 00000000020ca322 R14: 00007ffe0189b760 R15: 0000000000000003
    Modules linked in: act_tunnel_key act_gact cls_flower sch_ingress vrf veth act_csum(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp snd_hda_codec_generic kvm_intel kvm irqbypass snd_hda_intel crct10dif_pclmul crc32_pclmul hp_wmi ghash_clmulni_intel pcbc snd_hda_codec aesni_intel sparse_keymap rfkill snd_hda_core snd_hwdep snd_seq crypto_simd iTCO_wdt gpio_ich iTCO_vendor_support wmi_bmof cryptd mei_wdt glue_helper snd_seq_device snd_pcm pcspkr snd_timer snd i2c_i801 lpc_ich sg soundcore wmi mei_me
    mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_intel libahci serio_raw sfc libata mtd drm ixgbe mdio i2c_core e1000e dca
    CR2: 0000000000000000
    ---[ end trace 1ab8b5b5d4639dfc ]---
    RIP: 0010:tcf_action_exec+0xb8/0x100
    Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
    RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
    RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
    RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
    R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
    R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
    FS: 00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
    Kernel panic - not syncing: Fatal exception in interrupt
    Kernel Offset: 0x11400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

    Fixes: d0f6dd8a914f ("net/sched: Introduce act_tunnel_key")
    Signed-off-by: Davide Caratti
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Davide Caratti
     
  • [ Upstream commit a9ba23d48dbc6ffd08426bb10f05720e0b9f5c14 ]

    At present the ipv6_renew_options_kern() function ends up calling into
    access_ok() which is problematic if done from inside an interrupt as
    access_ok() calls WARN_ON_IN_IRQ() on some (all?) architectures
    (x86-64 is affected). Example warning/backtrace is shown below:

    WARNING: CPU: 1 PID: 3144 at lib/usercopy.c:11 _copy_from_user+0x85/0x90
    ...
    Call Trace:

    ipv6_renew_option+0xb2/0xf0
    ipv6_renew_options+0x26a/0x340
    ipv6_renew_options_kern+0x2c/0x40
    calipso_req_setattr+0x72/0xe0
    netlbl_req_setattr+0x126/0x1b0
    selinux_netlbl_inet_conn_request+0x80/0x100
    selinux_inet_conn_request+0x6d/0xb0
    security_inet_conn_request+0x32/0x50
    tcp_conn_request+0x35f/0xe00
    ? __lock_acquire+0x250/0x16c0
    ? selinux_socket_sock_rcv_skb+0x1ae/0x210
    ? tcp_rcv_state_process+0x289/0x106b
    tcp_rcv_state_process+0x289/0x106b
    ? tcp_v6_do_rcv+0x1a7/0x3c0
    tcp_v6_do_rcv+0x1a7/0x3c0
    tcp_v6_rcv+0xc82/0xcf0
    ip6_input_finish+0x10d/0x690
    ip6_input+0x45/0x1e0
    ? ip6_rcv_finish+0x1d0/0x1d0
    ipv6_rcv+0x32b/0x880
    ? ip6_make_skb+0x1e0/0x1e0
    __netif_receive_skb_core+0x6f2/0xdf0
    ? process_backlog+0x85/0x250
    ? process_backlog+0x85/0x250
    ? process_backlog+0xec/0x250
    process_backlog+0xec/0x250
    net_rx_action+0x153/0x480
    __do_softirq+0xd9/0x4f7
    do_softirq_own_stack+0x2a/0x40

    ...

    While not present in the backtrace, ipv6_renew_option() ends up calling
    access_ok() via the following chain:

    access_ok()
    _copy_from_user()
    copy_from_user()
    ipv6_renew_option()

    The fix presented in this patch is to perform the userspace copy
    earlier in the call chain such that it is only called when the option
    data is actually coming from userspace; that place is
    do_ipv6_setsockopt(). Not only does this solve the problem seen in
    the backtrace above, it also allows us to simplify the code quite a
    bit by removing ipv6_renew_options_kern() completely. We also take
    this opportunity to cleanup ipv6_renew_options()/ipv6_renew_option()
    a small amount as well.

    This patch is heavily based on a rough patch by Al Viro. I've taken
    his original patch, converted a kmemdup() call in do_ipv6_setsockopt()
    to a memdup_user() call, made better use of the e_inval jump target in
    the same function, and cleaned up the use ipv6_renew_option() by
    ipv6_renew_options().

    CC: Al Viro
    Signed-off-by: Paul Moore
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul Moore
     
  • [ Upstream commit 9ce7bc036ae4cfe3393232c86e9e1fea2153c237 ]

    It is a waste of memory to use a full "struct netns_sysctl_ipv6"
    while only one pointer is really used, considering netns_sysctl_ipv6
    keeps growing.

    Also, since "struct netns_frags" has cache line alignment,
    it is better to move the frags_hdr pointer outside, otherwise
    we spend a full cache line for this pointer.

    This saves 192 bytes of memory per netns.

    Fixes: c038a767cd69 ("ipv6: add a new namespace for nf_conntrack_reasm")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

22 Aug, 2018

2 commits

  • [ Upstream commit 455f05ecd2b219e9a216050796d30c830d9bc393 ]

    syzbot reported that we reinitialize an active delayed
    work in vsock_stream_connect():

    ODEBUG: init active (active state 0) object type: timer_list hint:
    delayed_work_timer_fn+0x0/0x90 kernel/workqueue.c:1414
    WARNING: CPU: 1 PID: 11518 at lib/debugobjects.c:329
    debug_print_object+0x16a/0x210 lib/debugobjects.c:326

    The pattern is apparently wrong, we should only initialize
    the dealyed work once and could repeatly schedule it. So we
    have to move out the initializations to allocation side.
    And to avoid confusion, we can split the shared dwork
    into two, instead of re-using the same one.

    Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
    Reported-by:
    Cc: Andy king
    Cc: Stefan Hajnoczi
    Cc: Jorgen Hansen
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit 0dcb82254d65f72333aa50ad626d1e9665ad093b ]

    llc_sap_put() decreases the refcnt before deleting sap
    from the global list. Therefore, there is a chance
    llc_sap_find() could find a sap with zero refcnt
    in this global list.

    Close this race condition by checking if refcnt is zero
    or not in llc_sap_find(), if it is zero then it is being
    removed so we can just treat it as gone.

    Reported-by:
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     

03 Aug, 2018

1 commit


28 Jul, 2018

3 commits

  • [ Upstream commit a0496ef2c23b3b180902dd185d0d63ccbc624cf8 ]

    Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
    has to be sent immediately so the sender can respond quickly:

    """ When receiving packets, the CE codepoint MUST be processed as follows:

    1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
    true and send an immediate ACK.

    2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
    to false and send an immediate ACK.
    """

    Previously DCTCP implementation may continue to delay the ACK. This
    patch fixes that to implement the RFC by forcing an immediate ACK.

    Tested with this packetdrill script provided by Larry Brakmo

    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
    0.000 bind(3, ..., ...) = 0
    0.000 listen(3, 1) = 0

    0.100 < [ect0] SEW 0:0(0) win 32792
    0.100 > SE. 0:0(0) ack 1
    0.110 < [ect0] . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

    0.200 < [ect0] . 1:1001(1000) ack 1 win 257
    0.200 > [ect01] . 1:1(0) ack 1001

    0.200 write(4, ..., 1) = 1
    0.200 > [ect01] P. 1:2(1) ack 1001

    0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
    +0.005 < [ce] . 2001:3001(1000) ack 2 win 257

    +0.000 > [ect01] . 2:2(0) ack 2001
    // Previously the ACK below would be delayed by 40ms
    +0.000 > [ect01] E. 2:2(0) ack 3001

    +0.500 < F. 9501:9501(0) ack 4 win 257

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     
  • [ Upstream commit 27cde44a259c380a3c09066fc4b42de7dde9b1ad ]

    Currently when a DCTCP receiver delays an ACK and receive a
    data packet with a different CE mark from the previous one's, it
    sends two immediate ACKs acking previous and latest sequences
    respectly (for ECN accounting).

    Previously sending the first ACK may mark off the delayed ACK timer
    (tcp_event_ack_sent). This may subsequently prevent sending the
    second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
    The culprit is that tcp_send_ack() assumes it always acknowleges
    the latest sequence, which is not true for the first special ACK.

    The fix is to not make the assumption in tcp_send_ack and check the
    actual ack sequence before cancelling the delayed ACK. Further it's
    safer to pass the ack sequence number as a local variable into
    tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
    future bugs like this.

    Reported-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     
  • [ Upstream commit 24b711edfc34bc45777a3f068812b7d1ed004a5d ]

    Example setup:
    host: ip -6 addr add dev eth1 2001:db8:104::4
    where eth1 is enslaved to a VRF

    switch: ip -6 ro add 2001:db8:104::4/128 dev br1
    where br1 only has an LLA

    ping6 2001:db8:104::4
    ssh 2001:db8:104::4

    (NOTE: UDP works fine if the PKTINFO has the address set to the global
    address and ifindex is set to the index of eth1 with a destination an
    LLA).

    For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
    L3 master. If it is then return the ifindex from rt6i_idev similar
    to what is done for loopback.

    For TCP, restore the original tcp_v6_iif definition which is needed in
    most places and add a new tcp_v6_iif_l3_slave that considers the
    l3_slave variability. This latter check is only needed for socket
    lookups.

    Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     

25 Jul, 2018

1 commit

  • [ Upstream commit 169dc027fb02492ea37a0575db6a658cf922b854 ]

    The rol32 call is currently rotating hash but the rol'd value is
    being discarded. I believe the current code is incorrect and hash
    should be assigned the rotated value returned from rol32.

    Thanks to David Lebrun for spotting this.

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Colin Ian King