08 Mar, 2020

1 commit

  • Merge Linux stable release v5.4.24 into imx_5.4.y

    * tag 'v5.4.24': (3306 commits)
    Linux 5.4.24
    blktrace: Protect q->blk_trace with RCU
    kvm: nVMX: VMWRITE checks unsupported field before read-only field
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx6sll-evk.dts
    arch/arm/boot/dts/imx7ulp.dtsi
    arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
    drivers/clk/imx/clk-composite-8m.c
    drivers/gpio/gpio-mxc.c
    drivers/irqchip/Kconfig
    drivers/mmc/host/sdhci-of-esdhc.c
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/can/flexcan.c
    drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
    drivers/net/ethernet/mscc/ocelot.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/realtek.c
    drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/tee/optee/shm_pool.c
    drivers/usb/cdns3/gadget.c
    kernel/sched/cpufreq.c
    net/core/xdp.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c
    sound/soc/sof/core.c
    sound/soc/sof/imx/Kconfig
    sound/soc/sof/loader.c

    Jason Liu
     

05 Mar, 2020

3 commits

  • [ Upstream commit 7151affeef8d527f50b4b68a871fd28bd660023f ]

    netdev_next_lower_dev_rcu() will be used to implement a function,
    which is to walk all lower interfaces.
    There are already functions that they walk their lower interface.
    (netdev_walk_all_lower_dev_rcu, netdev_walk_all_lower_dev()).
    But, there would be cases that couldn't be covered by given
    netdev_walk_all_lower_dev_{rcu}() function.
    So, some modules would want to implement own function,
    which is to walk all lower interfaces.

    In the next patch, netdev_next_lower_dev_rcu() will be used.
    In addition, this patch removes two unused prototypes in netdevice.h.

    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • [ Upstream commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323 ]

    This reverts commit ba27b4cdaaa66561aaedb2101876e563738d36fe

    Ahmed reported ouf-of-order issues bisected to commit ba27b4cdaaa6
    ("net: dev: introduce support for sch BYPASS for lockless qdisc").
    I can't find any working solution other than a plain revert.

    This will introduce some minor performance regressions for
    pfifo_fast qdisc. I plan to address them in net-next with more
    indirect call wrapper boilerplate for qdiscs.

    Reported-by: Ahmad Fatoum
    Fixes: ba27b4cdaaa6 ("net: dev: introduce support for sch BYPASS for lockless qdisc")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit 540e585a79e9d643ede077b73bcc7aa2d7b4d919 ]

    In 709772e6e06564ed94ba740de70185ac3d792773, RT_TABLE_COMPAT was added to
    allow legacy software to deal with routing table numbers >= 256, but the
    same change to FIB rule queries was overlooked.

    Signed-off-by: Jethro Beekman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jethro Beekman
     

24 Feb, 2020

3 commits

  • [ Upstream commit 0a29275b6300f39f78a87f2038bbfe5bdbaeca47 ]

    A negative value should be returned if map->map_type is invalid
    although that is impossible now, but if we run into such situation
    in future, then xdpbuff could be leaked.

    Daniel Borkmann suggested:

    -EBADRQC should be returned to stay consistent with generic XDP
    for the tracepoint output and not to be confused with -EOPNOTSUPP
    from other locations like dev_map_enqueue() when ndo_xdp_xmit is
    missing and such.

    Suggested-by: Daniel Borkmann
    Signed-off-by: Li RongQing
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1578618277-18085-1-git-send-email-lirongqing@baidu.com
    Signed-off-by: Sasha Levin

    Li RongQing
     
  • [ Upstream commit 0b2dc83906cf1e694e48003eae5df8fa63f76fd9 ]

    We need to have a synchronize_rcu before free'ing the sockhash because any
    outstanding psock references will have a pointer to the map and when they
    use it, this could trigger a use after free.

    This is a sister fix for sockhash, following commit 2bb90e5cc90e ("bpf:
    sockmap, synchronize_rcu before free'ing map") which addressed sockmap,
    which comes from a manual audit.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200206111652.694507-3-jakub@cloudflare.com
    Signed-off-by: Sasha Levin

    Jakub Sitnicki
     
  • [ Upstream commit ad1e03b2b3d4430baaa109b77bc308dc73050de3 ]

    The current generic XDP handler skips execution of XDP programs entirely if
    an SKB is marked as cloned. This leads to some surprising behaviour, as
    packets can end up being cloned in various ways, which will make an XDP
    program not see all the traffic on an interface.

    This was discovered by a simple test case where an XDP program that always
    returns XDP_DROP is installed on a veth device. When combining this with
    the Scapy packet sniffer (which uses an AF_PACKET) socket on the sending
    side, SKBs reliably end up in the cloned state, causing them to be passed
    through to the receiving interface instead of being dropped. A minimal
    reproducer script for this is included below.

    This patch fixed the issue by simply triggering the existing linearisation
    code for cloned SKBs instead of skipping the XDP program execution. This
    behaviour is in line with the behaviour of the native XDP implementation
    for the veth driver, which will reallocate and copy the SKB data if the SKB
    is marked as shared.

    Reproducer Python script (requires BCC and Scapy):

    from scapy.all import TCP, IP, Ether, sendp, sniff, AsyncSniffer, Raw, UDP
    from bcc import BPF
    import time, sys, subprocess, shlex

    SKB_MODE = (1 << 1)
    DRV_MODE = (1 << 2)
    PYTHON=sys.executable

    def client():
    time.sleep(2)
    # Sniffing on the sender causes skb_cloned() to be set
    s = AsyncSniffer()
    s.start()

    for p in range(10):
    sendp(Ether(dst="aa:aa:aa:aa:aa:aa", src="cc:cc:cc:cc:cc:cc")/IP()/UDP()/Raw("Test"),
    verbose=False)
    time.sleep(0.1)

    s.stop()
    return 0

    def server(mode):
    prog = BPF(text="int dummy_drop(struct xdp_md *ctx) {return XDP_DROP;}")
    func = prog.load_func("dummy_drop", BPF.XDP)
    prog.attach_xdp("a_to_b", func, mode)

    time.sleep(1)

    s = sniff(iface="a_to_b", count=10, timeout=15)
    if len(s):
    print(f"Got {len(s)} packets - should have gotten 0")
    return 1
    else:
    print("Got no packets - as expected")
    return 0

    if len(sys.argv) < 2:
    print(f"Usage: {sys.argv[0]} ")
    sys.exit(1)

    if sys.argv[1] == "client":
    sys.exit(client())
    elif sys.argv[1] == "server":
    mode = SKB_MODE if sys.argv[2] == 'skb' else DRV_MODE
    sys.exit(server(mode))
    else:
    try:
    mode = sys.argv[1]
    if mode not in ('skb', 'drv'):
    print(f"Usage: {sys.argv[0]} ")
    sys.exit(1)
    print(f"Running in {mode} mode")

    for cmd in [
    'ip netns add netns_a',
    'ip netns add netns_b',
    'ip -n netns_a link add a_to_b type veth peer name b_to_a netns netns_b',
    # Disable ipv6 to make sure there's no address autoconf traffic
    'ip netns exec netns_a sysctl -qw net.ipv6.conf.a_to_b.disable_ipv6=1',
    'ip netns exec netns_b sysctl -qw net.ipv6.conf.b_to_a.disable_ipv6=1',
    'ip -n netns_a link set dev a_to_b address aa:aa:aa:aa:aa:aa',
    'ip -n netns_b link set dev b_to_a address cc:cc:cc:cc:cc:cc',
    'ip -n netns_a link set dev a_to_b up',
    'ip -n netns_b link set dev b_to_a up']:
    subprocess.check_call(shlex.split(cmd))

    server = subprocess.Popen(shlex.split(f"ip netns exec netns_a {PYTHON} {sys.argv[0]} server {mode}"))
    client = subprocess.Popen(shlex.split(f"ip netns exec netns_b {PYTHON} {sys.argv[0]} client"))

    client.wait()
    server.wait()
    sys.exit(server.returncode)

    finally:
    subprocess.run(shlex.split("ip netns delete netns_a"))
    subprocess.run(shlex.split("ip netns delete netns_b"))

    Fixes: d445516966dc ("net: xdp: support xdp generic on virtual devices")
    Reported-by: Stepan Horacek
    Suggested-by: Paolo Abeni
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Toke Høiland-Jørgensen
     

15 Feb, 2020

4 commits

  • commit 85b8ac01a421791d66c3a458a7f83cfd173fe3fa upstream.

    It's currently possible to insert sockets in unexpected states into
    a sockmap, due to a TOCTTOU when updating the map from a syscall.
    sock_map_update_elem checks that sk->sk_state == TCP_ESTABLISHED,
    locks the socket and then calls sock_map_update_common. At this
    point, the socket may have transitioned into another state, and
    the earlier assumptions don't hold anymore. Crucially, it's
    conceivable (though very unlikely) that a socket has become unhashed.
    This breaks the sockmap's assumption that it will get a callback
    via sk->sk_prot->unhash.

    Fix this by checking the (fixed) sk_type and sk_protocol without the
    lock, followed by a locked check of sk_state.

    Unfortunately it's not possible to push the check down into
    sock_(map|hash)_update_common, since BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
    run before the socket has transitioned from TCP_SYN_RECV into
    TCP_ESTABLISHED.

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20200207103713.28175-1-lmb@cloudflare.com
    Signed-off-by: Greg Kroah-Hartman

    Lorenz Bauer
     
  • commit 88d6f130e5632bbf419a2e184ec7adcbe241260b upstream.

    It was reported that the max_t, ilog2, and roundup_pow_of_two macros have
    exponential effects on the number of states in the sparse checker.

    This patch breaks them up by calculating the "nbuckets" first so that the
    "bucket_log" only needs to take ilog2().

    In addition, Linus mentioned:

    Patch looks good, but I'd like to point out that it's not just sparse.

    You can see it with a simple

    make net/core/bpf_sk_storage.i
    grep 'smap->bucket_log = ' net/core/bpf_sk_storage.i | wc

    and see the end result:

    1 365071 2686974

    That's one line (the assignment line) that is 2,686,974 characters in
    length.

    Now, sparse does happen to react particularly badly to that (I didn't
    look to why, but I suspect it's just that evaluating all the types
    that don't actually ever end up getting used ends up being much more
    expensive than it should be), but I bet it's not good for gcc either.

    Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
    Reported-by: Randy Dunlap
    Reported-by: Luc Van Oostenryck
    Suggested-by: Linus Torvalds
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Luc Van Oostenryck
    Link: https://lore.kernel.org/bpf/20200207081810.3918919-1-kafai@fb.com
    Signed-off-by: Greg Kroah-Hartman

    Martin KaFai Lau
     
  • commit 0b2dc83906cf1e694e48003eae5df8fa63f76fd9 upstream.

    We need to have a synchronize_rcu before free'ing the sockhash because any
    outstanding psock references will have a pointer to the map and when they
    use it, this could trigger a use after free.

    This is a sister fix for sockhash, following commit 2bb90e5cc90e ("bpf:
    sockmap, synchronize_rcu before free'ing map") which addressed sockmap,
    which comes from a manual audit.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200206111652.694507-3-jakub@cloudflare.com
    Signed-off-by: Greg Kroah-Hartman

    Jakub Sitnicki
     
  • commit db6a5018b6e008c1d69c6628cdaa9541b8e70940 upstream.

    rcu_read_lock is needed to protect access to psock inside sock_map_unref
    when tearing down the map. However, we can't afford to sleep in lock_sock
    while in RCU read-side critical section. Grab the RCU lock only after we
    have locked the socket.

    This fixes RCU warnings triggerable on a VM with 1 vCPU when free'ing a
    sockmap/sockhash that contains at least one socket:

    | =============================
    | WARNING: suspicious RCU usage
    | 5.5.0-04005-g8fc91b972b73 #450 Not tainted
    | -----------------------------
    | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
    |
    | other info that might help us debug this:
    |
    |
    | rcu_scheduler_active = 2, debug_locks = 1
    | 4 locks held by kworker/0:1/62:
    | #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
    | #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
    | #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170
    | #3: ffff8881368c5df8 (&stab->lock){+...}, at: sock_map_free+0x64/0x170
    |
    | stack backtrace:
    | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73 #450
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    | Workqueue: events bpf_map_free_deferred
    | Call Trace:
    | dump_stack+0x71/0xa0
    | ___might_sleep+0x105/0x190
    | lock_sock_nested+0x28/0x90
    | sock_map_free+0x95/0x170
    | bpf_map_free_deferred+0x58/0x80
    | process_one_work+0x260/0x5e0
    | worker_thread+0x4d/0x3e0
    | kthread+0x108/0x140
    | ? process_one_work+0x5e0/0x5e0
    | ? kthread_park+0x90/0x90
    | ret_from_fork+0x3a/0x50

    | =============================
    | WARNING: suspicious RCU usage
    | 5.5.0-04005-g8fc91b972b73-dirty #452 Not tainted
    | -----------------------------
    | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
    |
    | other info that might help us debug this:
    |
    |
    | rcu_scheduler_active = 2, debug_locks = 1
    | 4 locks held by kworker/0:1/62:
    | #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
    | #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
    | #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_hash_free+0x5/0x1d0
    | #3: ffff888139966e00 (&htab->buckets[i].lock){+...}, at: sock_hash_free+0x92/0x1d0
    |
    | stack backtrace:
    | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73-dirty #452
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    | Workqueue: events bpf_map_free_deferred
    | Call Trace:
    | dump_stack+0x71/0xa0
    | ___might_sleep+0x105/0x190
    | lock_sock_nested+0x28/0x90
    | sock_hash_free+0xec/0x1d0
    | bpf_map_free_deferred+0x58/0x80
    | process_one_work+0x260/0x5e0
    | worker_thread+0x4d/0x3e0
    | kthread+0x108/0x140
    | ? process_one_work+0x5e0/0x5e0
    | ? kthread_park+0x90/0x90
    | ret_from_fork+0x3a/0x50

    Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200206111652.694507-2-jakub@cloudflare.com
    Signed-off-by: Greg Kroah-Hartman

    Jakub Sitnicki
     

11 Feb, 2020

2 commits

  • [ Upstream commit dfa7f709596be5ca46c070d4f8acbb344322056a ]

    Drop monitor uses a work item that takes care of constructing and
    sending netlink notifications to user space. In case drop monitor never
    started to monitor, then the work item is uninitialized and not
    associated with a function.

    Therefore, a stop command from user space results in canceling an
    uninitialized work item which leads to the following warning [1].

    Fix this by not processing a stop command if drop monitor is not
    currently monitoring.

    [1]
    [ 31.735402] ------------[ cut here ]------------
    [ 31.736470] WARNING: CPU: 0 PID: 143 at kernel/workqueue.c:3032 __flush_work+0x89f/0x9f0
    ...
    [ 31.738120] CPU: 0 PID: 143 Comm: dwdump Not tainted 5.5.0-custom-09491-g16d4077796b8 #727
    [ 31.741968] RIP: 0010:__flush_work+0x89f/0x9f0
    ...
    [ 31.760526] Call Trace:
    [ 31.771689] __cancel_work_timer+0x2a6/0x3b0
    [ 31.776809] net_dm_cmd_trace+0x300/0xef0
    [ 31.777549] genl_rcv_msg+0x5c6/0xd50
    [ 31.781005] netlink_rcv_skb+0x13b/0x3a0
    [ 31.784114] genl_rcv+0x29/0x40
    [ 31.784720] netlink_unicast+0x49f/0x6a0
    [ 31.787148] netlink_sendmsg+0x7cf/0xc80
    [ 31.790426] ____sys_sendmsg+0x620/0x770
    [ 31.793458] ___sys_sendmsg+0xfd/0x170
    [ 31.802216] __sys_sendmsg+0xdf/0x1a0
    [ 31.806195] do_syscall_64+0xa0/0x540
    [ 31.806885] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fixes: 8e94c3bc922e ("drop_monitor: Allow user to start monitoring hardware drops")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit d5b90e99e1d51b7b5d2b74fbc4c2db236a510913 ]

    commit fdd41ec21e15 ("devlink: Return right error code in case of errors
    for region read") modified the region read code to report errors
    properly in unexpected cases.

    In the case where the start_offset and ret_offset match, it unilaterally
    converted this into an error. This causes an issue for the "dump"
    version of the command. In this case, the devlink region dump will
    always report an invalid argument:

    000000000000ffd0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    000000000000ffe0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    devlink answers: Invalid argument
    000000000000fff0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

    This occurs because the expected flow for the dump is to return 0 after
    there is no further data.

    The simplest fix would be to stop converting the error code to -EINVAL
    if start_offset == ret_offset. However, avoid unnecessary work by
    checking for when start_offset is larger than the region size and
    returning 0 upfront.

    Fixes: fdd41ec21e15 ("devlink: Return right error code in case of errors for region read")
    Signed-off-by: Jacob Keller
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jacob Keller
     

06 Feb, 2020

2 commits

  • [ Upstream commit 59fb9b62fb6c929a756563152a89f39b07cf8893 ]

    This patch applies new flag (FLOW_DISSECTOR_KEY_PORTS_RANGE) and
    field (tp_range) to BPF flow dissector to generate appropriate flow
    keys when classified by specified port ranges.

    Fixes: 8ffb055beae5 ("cls_flower: Fix the behavior using port ranges with hw-offload")
    Signed-off-by: Yoshiki Komachi
    Signed-off-by: Daniel Borkmann
    Acked-by: Petar Penkov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200117070533.402240-2-komachi.yoshiki@gmail.com
    Signed-off-by: Sasha Levin

    Yoshiki Komachi
     
  • [ Upstream commit 189c9b1e94539b11c80636bc13e9cf47529e7bba ]

    skb->csum is updated incorrectly, when manipulation for
    NF_NAT_MANIP_SRC\DST is done on IPV6 packet.

    Fix:
    There is no need to update skb->csum in inet_proto_csum_replace16(),
    because update in two fields a.) IPv6 src/dst address and b.) L4 header
    checksum cancels each other for skb->csum calculation. Whereas
    inet_proto_csum_replace4 function needs to update skb->csum, because
    update in 3 fields a.) IPv4 src/dst address, b.) IPv4 Header checksum
    and c.) L4 header checksum results in same diff as L4 Header checksum
    for skb->csum calculation.

    [ pablo@netfilter.org: a few comestic documentation edits ]
    Signed-off-by: Praveen Chaudhary
    Signed-off-by: Zhenggen Xu
    Signed-off-by: Andy Stracner
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Praveen Chaudhary
     

29 Jan, 2020

4 commits

  • commit 58c8db929db1c1d785a6f5d8f8692e5dbcc35e84 upstream.

    As John Fastabend reports [0], psock state tear-down can happen on receive
    path *after* unlocking the socket, if the only other psock user, that is
    sockmap or sockhash, releases its psock reference before tcp_bpf_recvmsg
    does so:

    tcp_bpf_recvmsg()
    psock = sk_psock_get(sk)
    Signed-off-by: Jakub Sitnicki
    Acked-by: John Fastabend
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jakub Sitnicki
     
  • [ Upstream commit c80794323e82ac6ab45052ebba5757ce47b4b588 ]

    Commit 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL
    skbs") introduces batching of GRO_NORMAL packets in napi_frags_finish,
    and commit 6570bc79c0df ("net: core: use listified Rx for GRO_NORMAL in
    napi_gro_receive()") adds the same to napi_skb_finish. However,
    dev_gro_receive (that is called just before napi_{frags,skb}_finish) can
    also pass skbs to the networking stack: e.g., when the GRO session is
    flushed, napi_gro_complete is called, which passes pp directly to
    netif_receive_skb_internal, skipping napi->rx_list. It means that the
    packet stored in pp will be handled by the stack earlier than the
    packets that arrived before, but are still waiting in napi->rx_list. It
    leads to TCP reorderings that can be observed in the TCPOFOQueue counter
    in netstat.

    This commit fixes the reordering issue by making napi_gro_complete also
    use napi->rx_list, so that all packets going through GRO will keep their
    order. In order to keep napi_gro_flush working properly, gro_normal_list
    calls are moved after the flush to clear napi->rx_list.

    iwlwifi calls napi_gro_flush directly and does the same thing that is
    done by gro_normal_list, so the same change is applied there:
    napi_gro_flush is moved to be before the flush of napi->rx_list.

    A few other drivers also use napi_gro_flush (brocade/bna/bnad.c,
    cortina/gemini.c, hisilicon/hns3/hns3_enet.c). The first two also use
    napi_complete_done afterwards, which performs the gro_normal_list flush,
    so they are fine. The latter calls napi_gro_receive right after
    napi_gro_flush, so it can end up with non-empty napi->rx_list anyway.

    Fixes: 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs")
    Signed-off-by: Maxim Mikityanskiy
    Cc: Alexander Lobakin
    Cc: Edward Cree
    Acked-by: Alexander Lobakin
    Acked-by: Saeed Mahameed
    Acked-by: Edward Cree
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Maxim Mikityanskiy
     
  • [ Upstream commit cb626bf566eb4433318d35681286c494f04fedcc ]

    Netdev_register_kobject is calling device_initialize. In case of error
    reference taken by device_initialize is not given up.

    Drivers are supposed to call free_netdev in case of error. In non-error
    case the last reference is given up there and device release sequence
    is triggered. In error case this reference is kept and the release
    sequence is never started.

    Fix this by setting reg_state as NETREG_UNREGISTERED if registering
    fails.

    This is the rootcause for couple of memory leaks reported by Syzkaller:

    BUG: memory leak unreferenced object 0xffff8880675ca008 (size 256):
    comm "netdev_register", pid 281, jiffies 4294696663 (age 6.808s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmem_cache_alloc_trace+0x167/0x280
    [] device_add+0x882/0x1750
    [] netdev_register_kobject+0x128/0x380
    [] register_netdevice+0xa1b/0xf00
    [] __tun_chr_ioctl+0x20d5/0x3dd0
    [] tun_chr_ioctl+0x2f/0x40
    [] do_vfs_ioctl+0x1c7/0x1510
    [] ksys_ioctl+0x99/0xb0
    [] __x64_sys_ioctl+0x78/0xb0
    [] do_syscall_64+0x16f/0x580
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [] 0xffffffffffffffff

    BUG: memory leak
    unreferenced object 0xffff8880668ba588 (size 8):
    comm "kobject_set_nam", pid 286, jiffies 4294725297 (age 9.871s)
    hex dump (first 8 bytes):
    6e 72 30 00 cc be df 2b nr0....+
    backtrace:
    [] __kmalloc_track_caller+0x16e/0x290
    [] kstrdup+0x3e/0x70
    [] kstrdup_const+0x3e/0x50
    [] kvasprintf_const+0x10e/0x160
    [] kobject_set_name_vargs+0x5b/0x140
    [] dev_set_name+0xc0/0xf0
    [] netdev_register_kobject+0xc8/0x320
    [] register_netdevice+0xa1b/0xf00
    [] __tun_chr_ioctl+0x20d5/0x3dd0
    [] tun_chr_ioctl+0x2f/0x40
    [] do_vfs_ioctl+0x1c7/0x1510
    [] ksys_ioctl+0x99/0xb0
    [] __x64_sys_ioctl+0x78/0xb0
    [] do_syscall_64+0x16f/0x580
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [] 0xffffffffffffffff

    v3 -> v4:
    Set reg_state to NETREG_UNREGISTERED if registering fails

    v2 -> v3:
    * Replaced BUG_ON with WARN_ON in free_netdev and netdev_release

    v1 -> v2:
    * Relying on driver calling free_netdev rather than calling
    put_device directly in error path

    Reported-by: syzbot+ad8ca40ecd77896d51e2@syzkaller.appspotmail.com
    Cc: David Miller
    Cc: Greg Kroah-Hartman
    Cc: Lukas Bulwahn
    Signed-off-by: Jouni Hogander
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jouni Hogander
     
  • [ Upstream commit d836f5c69d87473ff65c06a6123e5b2cf5e56f5b ]

    rtnl_create_link() needs to apply dev->min_mtu and dev->max_mtu
    checks that we apply in do_setlink()

    Otherwise malicious users can crash the kernel, for example after
    an integer overflow :

    BUG: KASAN: use-after-free in memset include/linux/string.h:365 [inline]
    BUG: KASAN: use-after-free in __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
    Write of size 32 at addr ffff88819f20b9c0 by task swapper/0/0

    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x197/0x210 lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
    __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
    kasan_report+0x12/0x20 mm/kasan/common.c:639
    check_memory_region_inline mm/kasan/generic.c:185 [inline]
    check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192
    memset+0x24/0x40 mm/kasan/common.c:108
    memset include/linux/string.h:365 [inline]
    __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
    alloc_skb include/linux/skbuff.h:1049 [inline]
    alloc_skb_with_frags+0x93/0x590 net/core/skbuff.c:5664
    sock_alloc_send_pskb+0x7ad/0x920 net/core/sock.c:2242
    sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2259
    mld_newpack+0x1d7/0x7f0 net/ipv6/mcast.c:1609
    add_grhead.isra.0+0x299/0x370 net/ipv6/mcast.c:1713
    add_grec+0x7db/0x10b0 net/ipv6/mcast.c:1844
    mld_send_cr net/ipv6/mcast.c:1970 [inline]
    mld_ifc_timer_expire+0x3d3/0x950 net/ipv6/mcast.c:2477
    call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404
    expire_timers kernel/time/timer.c:1449 [inline]
    __run_timers kernel/time/timer.c:1773 [inline]
    __run_timers kernel/time/timer.c:1740 [inline]
    run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786
    __do_softirq+0x262/0x98c kernel/softirq.c:292
    invoke_softirq kernel/softirq.c:373 [inline]
    irq_exit+0x19b/0x1e0 kernel/softirq.c:413
    exiting_irq arch/x86/include/asm/apic.h:536 [inline]
    smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1137
    apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829

    RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
    Code: 98 6b ea f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 44 1c 60 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 34 1c 60 00 fb f4 cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 4e 5d 9a f9 e8 79
    RSP: 0018:ffffffff89807ce8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
    RAX: 1ffffffff13266ae RBX: ffffffff8987a1c0 RCX: 0000000000000000
    RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffffffff8987aa54
    RBP: ffffffff89807d18 R08: ffffffff8987a1c0 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
    R13: ffffffff8a799980 R14: 0000000000000000 R15: 0000000000000000
    arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:690
    default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
    cpuidle_idle_call kernel/sched/idle.c:154 [inline]
    do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269
    cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361
    rest_init+0x23b/0x371 init/main.c:451
    arch_call_rest_init+0xe/0x1b
    start_kernel+0x904/0x943 init/main.c:784
    x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
    x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:471
    secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242

    The buggy address belongs to the page:
    page:ffffea00067c82c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
    raw: 057ffe0000000000 ffffea00067c82c8 ffffea00067c82c8 0000000000000000
    raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff88819f20b880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    ffff88819f20b900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    >ffff88819f20b980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    ^
    ffff88819f20ba00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    ffff88819f20ba80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

    Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

26 Jan, 2020

2 commits

  • [ Upstream commit e0b60903b434a7ee21ba8d8659f207ed84101e89 ]

    Dev_hold has to be called always in netdev_queue_add_kobject.
    Otherwise usage count drops below 0 in case of failure in
    kobject_init_and_add.

    Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
    Reported-by: Hulk Robot
    Cc: Tetsuo Handa
    Cc: David Miller
    Cc: Lukas Bulwahn
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Jouni Hogander
     
  • [ Upstream commit 9d027e3a83f39b819e908e4e09084277a2e45e95 ]

    A difference of two unsigned long needs long storage.

    Fixes: c7fb64db001f ("[NETLINK]: Neighbour table configuration and statistics via rtnetlink")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     

23 Jan, 2020

6 commits

  • commit 4c582234ab3948d08a24c82eb1e00436aabacbc6 upstream.

    The commit cited below causes devlink to emit a warning if a type was
    not set on a devlink port for longer than 30 seconds to "prevent
    misbehavior of drivers". This proved to be problematic when
    unregistering the backing netdev. The flow is always:

    devlink_port_type_clear() // schedules the warning
    unregister_netdev() // blocking
    devlink_port_unregister() // cancels the warning

    The call to unregister_netdev() can block for long periods of time for
    various reasons: RTNL lock is contended, large amounts of configuration
    to unroll following dismantle of the netdev, etc. This results in
    devlink emitting a warning despite the driver behaving correctly.

    In emulated environments (of future hardware) which are usually very
    slow, the warning can also be emitted during port creation as more than
    30 seconds can pass between the time the devlink port is registered and
    when its type is set.

    In addition, syzbot has hit this warning [1] 1974 times since 07/11/19
    without being able to produce a reproducer. Probably because
    reproduction depends on the load or other bugs (e.g., RTNL not being
    released).

    To prevent bogus warnings, increase the timeout to 1 hour.

    [1] https://syzkaller.appspot.com/bug?id=e99b59e9c024a666c9f7450dc162a4b74d09d9cb

    Fixes: 136bf27fc0e9 ("devlink: add warning in case driver does not set port type")
    Signed-off-by: Ido Schimmel
    Reported-by: syzbot+b0a18ed7b08b735d2f41@syzkaller.appspotmail.com
    Reported-by: Alex Veber
    Tested-by: Alex Veber
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit 53d374979ef147ab51f5d632dfe20b14aebeccd0 ]

    syzbot reported some bogus lockdep warnings, for example bad unlock
    balance in sch_direct_xmit(). They are due to a race condition between
    slow path and fast path, that is qdisc_xmit_lock_key gets re-registered
    in netdev_update_lockdep_key() on slow path, while we could still
    acquire the queue->_xmit_lock on fast path in this small window:

    CPU A CPU B
    __netif_tx_lock();
    lockdep_unregister_key(qdisc_xmit_lock_key);
    __netif_tx_unlock();
    lockdep_register_key(qdisc_xmit_lock_key);

    In fact, unlike the addr_list_lock which has to be reordered when
    the master/slave device relationship changes, queue->_xmit_lock is
    only acquired on fast path and only when NETIF_F_LLTX is not set,
    so there is likely no nested locking for it.

    Therefore, we can just get rid of re-registration of
    qdisc_xmit_lock_key.

    Reported-by: syzbot+4ec99438ed7450da6272@syzkaller.appspotmail.com
    Fixes: ab92d68fc22f ("net: core: add generic lockdep keys")
    Cc: Taehee Yoo
    Signed-off-by: Cong Wang
    Acked-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • commit 2e012c74823629d9db27963c79caa3f5b2010746 upstream.

    It's possible to leak time wait and request sockets via the following
    BPF pseudo code:
     
    sk = bpf_skc_lookup_tcp(...)
    if (sk)
    bpf_sk_release(sk)

    If sk->sk_state is TCP_NEW_SYN_RECV or TCP_TIME_WAIT the refcount taken
    by bpf_skc_lookup_tcp is not undone by bpf_sk_release. This is because
    sk_flags is re-used for other data in both kinds of sockets. The check

    !sock_flag(sk, SOCK_RCU_FREE)

    therefore returns a bogus result. Check that sk_flags is valid by calling
    sk_fullsock. Skip checking SOCK_RCU_FREE if we already know that sk is
    not a full socket.

    Fixes: edbf8c01de5a ("bpf: add skc_lookup_tcp helper")
    Fixes: f7355a6c0497 ("bpf: Check sk_fullsock() before returning from bpf_sk_lookup()")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200110132336.26099-1-lmb@cloudflare.com
    Signed-off-by: Greg Kroah-Hartman

    Lorenz Bauer
     
  • commit cf21e9ba1eb86c9333ca5b05b2f1cc94021bcaef upstream.

    Leaving an incorrect end mark in place when passing to crypto
    layer will cause crypto layer to stop processing data before
    all data is encrypted. To fix clear the end mark on push
    data instead of expecting users of the helper to clear the
    mark value after the fact.

    This happens when we push data into the middle of a skmsg and
    have room for it so we don't do a set of copies that already
    clear the end flag.

    Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-6-john.fastabend@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    John Fastabend
     
  • commit 6562e29cf6f0ddd368657d97a8d484ffc30df5ef upstream.

    In the push, pull, and pop helpers operating on skmsg objects to make
    data writable or insert/remove data we use this bounds check to ensure
    specified data is valid,

    /* Bounds checks: start and pop must be inside message */
    if (start >= offset + l || last >= msg->sg.size)
    return -EINVAL;

    The problem here is offset has already included the length of the
    current element the 'l' above. So start could be past the end of
    the scatterlist element in the case where start also points into an
    offset on the last skmsg element.

    To fix do the accounting slightly different by adding the length of
    the previous entry to offset at the start of the iteration. And
    ensure its initialized to zero so that the first iteration does
    nothing.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
    Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-5-john.fastabend@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    John Fastabend
     
  • commit 7e81a35302066c5a00b4c72d83e3ea4cad6eeb5b upstream.

    The sock_map_free() and sock_hash_free() paths used to delete sockmap
    and sockhash maps walk the maps and destroy psock and bpf state associated
    with the socks in the map. When done the socks no longer have BPF programs
    attached and will function normally. This can happen while the socks in
    the map are still "live" meaning data may be sent/received during the walk.

    Currently, though we don't take the sock_lock when the psock and bpf state
    is removed through this path. Specifically, this means we can be writing
    into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
    while they are also being called from the networking side. This is not
    safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
    believed it was safe. Further its not clear to me its even a good idea
    to try and do this on "live" sockets while networking side might also
    be using the socket. Instead of trying to reason about using the socks
    from both sides lets realize that every use case I'm aware of rarely
    deletes maps, in fact kubernetes/Cilium case builds map at init and
    never tears it down except on errors. So lets do the simple fix and
    grab sock lock.

    This patch wraps sock deletes from maps in sock lock and adds some
    annotations so we catch any other cases easier.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    John Fastabend
     

18 Jan, 2020

1 commit

  • commit 8163999db445021f2651a8a47b5632483e8722ea upstream.

    Report from Dan Carpenter,

    net/core/skmsg.c:792 sk_psock_write_space()
    error: we previously assumed 'psock' could be null (see line 790)

    net/core/skmsg.c
    789 psock = sk_psock(sk);
    790 if (likely(psock && sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)))
    Check for NULL
    791 schedule_work(&psock->work);
    792 write_space = psock->saved_write_space;
    ^^^^^^^^^^^^^^^^^^^^^^^^
    793 rcu_read_unlock();
    794 write_space(sk);

    Ensure psock dereference on line 792 only occurs if psock is not null.

    Reported-by: Dan Carpenter
    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    John Fastabend
     

12 Jan, 2020

1 commit

  • [ Upstream commit 5133498f4ad1123a5ffd4c08df6431dab882cc32 ]

    Redirecting a packet from ingress to egress by using bpf_redirect
    breaks if the egress interface has an fq qdisc installed. This is the same
    problem as fixed in 'commit 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths")

    Clear skb->tstamp when redirecting into the egress path.

    Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Link: https://lore.kernel.org/bpf/20191213180817.2510-1-lmb@cloudflare.com
    Signed-off-by: Sasha Levin

    Lorenz Bauer
     

10 Jan, 2020

1 commit

  • The page pool keeps track of the number of pages in flight, and
    it isn't safe to remove the pool until all pages are returned.

    Disallow removing the pool until all pages are back, so the pool
    is always available for page producers.

    Make the page pool responsible for its own delayed destruction
    instead of relying on XDP, so the page pool can be used without
    the xdp memory model.

    When all pages are returned, free the pool and notify xdp if the
    pool is registered with the xdp memory system. Have the callback
    perform a table walk since some drivers (cpsw) may share the pool
    among multiple xdp_rxq_info.

    Note that the increment of pages_state_release_cnt may result in
    inflight == 0, resulting in the pool being released.

    Fixes: d956a048cd3f ("xdp: force mem allocator removal and periodic warning")
    Signed-off-by: Jonathan Lemon
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Ilias Apalodimas
    Signed-off-by: David S. Miller

    Jonathan Lemon
     

09 Jan, 2020

4 commits

  • [ Upstream commit 7c68fa2bddda6d942bd387c9ba5b4300737fd991 ]

    sk->sk_pacing_shift can be read and written without lock
    synchronization. This patch adds annotations to
    document this fact and avoid future syzbot complains.

    This might also avoid unexpected false sharing
    in sk_pacing_shift_update(), as the compiler
    could remove the conditional check and always
    write over sk->sk_pacing_shift :

    if (sk->sk_pacing_shift != val)
    sk->sk_pacing_shift = val;

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • [ Upstream commit c305c6ae79e2ce20c22660ceda94f0d86d639a82 ]

    KCSAN reported a data-race [1]

    While we can use READ_ONCE() on the read sides,
    we need to make sure hh->hh_len is written last.

    [1]

    BUG: KCSAN: data-race in eth_header_cache / neigh_resolve_output

    write to 0xffff8880b9dedcb8 of 4 bytes by task 29760 on cpu 0:
    eth_header_cache+0xa9/0xd0 net/ethernet/eth.c:247
    neigh_hh_init net/core/neighbour.c:1463 [inline]
    neigh_resolve_output net/core/neighbour.c:1480 [inline]
    neigh_resolve_output+0x415/0x470 net/core/neighbour.c:1470
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    NF_HOOK include/linux/netfilter.h:305 [inline]
    ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505
    ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647
    rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615
    process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
    worker_thread+0xa0/0x800 kernel/workqueue.c:2415
    kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

    read to 0xffff8880b9dedcb8 of 4 bytes by task 29572 on cpu 1:
    neigh_resolve_output net/core/neighbour.c:1479 [inline]
    neigh_resolve_output+0x113/0x470 net/core/neighbour.c:1470
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    NF_HOOK include/linux/netfilter.h:305 [inline]
    ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505
    ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647
    rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615
    process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
    worker_thread+0xa0/0x800 kernel/workqueue.c:2415
    kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 29572 Comm: kworker/1:4 Not tainted 5.4.0-rc6+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Workqueue: events rt6_probe_deferred

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • commit 1148f9adbe71415836a18a36c1b4ece999ab0973 upstream.

    proc_dointvec_minmax_bpf_restricted() has been firstly introduced
    in commit 2e4a30983b0f ("bpf: restrict access to core bpf sysctls")
    under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in
    ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv
    allocations"), because a new sysctl, bpf_jit_limit, made use of it.
    Finally, this parameter has become long instead of integer with
    fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
    and thus, a new proc_dolongvec_minmax_bpf_restricted() has been
    added.

    With this last change, we got back to that
    proc_dointvec_minmax_bpf_restricted() is used only under
    CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been
    brought back.

    So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n
    since v4.20 we have:

    CC net/core/sysctl_net_core.o
    net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function]
    292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
    | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again.

    Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
    Signed-off-by: Alexander Lobakin
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191218091821.7080-1-alobakin@dlink.ru
    Signed-off-by: Greg Kroah-Hartman

    Alexander Lobakin
     
  • commit 90b2be27bb0e56483f335cc10fb59ec66882b949 upstream.

    KCSAN reported the following race [1]

    BUG: KCSAN: data-race in __dev_queue_xmit / net_tx_action

    read to 0xffff8880ba403508 of 1 bytes by task 21814 on cpu 1:
    __dev_xmit_skb net/core/dev.c:3389 [inline]
    __dev_queue_xmit+0x9db/0x1b40 net/core/dev.c:3761
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
    neigh_hh_output include/net/neighbour.h:500 [inline]
    neigh_output include/net/neighbour.h:509 [inline]
    ip6_finish_output2+0x873/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
    ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
    udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
    udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
    inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0x9f/0xc0 net/socket.c:657
    ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
    __sys_sendmmsg+0x123/0x350 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    write to 0xffff8880ba403508 of 1 bytes by interrupt on cpu 0:
    qdisc_run_begin include/net/sch_generic.h:160 [inline]
    qdisc_run include/net/pkt_sched.h:120 [inline]
    net_tx_action+0x2b1/0x6c0 net/core/dev.c:4551
    __do_softirq+0x115/0x33f kernel/softirq.c:292
    do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
    do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
    do_softirq kernel/softirq.c:329 [inline]
    __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
    local_bh_enable include/linux/bottom_half.h:32 [inline]
    rcu_read_unlock_bh include/linux/rcupdate.h:688 [inline]
    ip6_finish_output2+0x7bb/0xec0 net/ipv6/ip6_output.c:117
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
    ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
    udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
    udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
    inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0x9f/0xc0 net/socket.c:657
    ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
    __sys_sendmmsg+0x123/0x350 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 21817 Comm: syz-executor.2 Not tainted 5.4.0-rc6+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: d518d2ed8640 ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Cc: Paolo Abeni
    Cc: Davide Caratti
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

31 Dec, 2019

2 commits

  • [ Upstream commit f394722fb0d0f701119368959d7cd0ecbc46363a ]

    neigh_cleanup() has not been used for seven years, and was a wrong design.

    Messing with shared pointer in bond_neigh_init() without proper
    memory barriers would at least trigger syzbot complains eventually.

    It is time to remove this stuff.

    Fixes: b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ddd9b5e3e765d8ed5a35786a6cb00111713fe161 ]

    Dev_hold has to be called always in rx_queue_add_kobject.
    Otherwise usage count drops below 0 in case of failure in
    kobject_init_and_add.

    Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
    Reported-by: syzbot
    Cc: Tetsuo Handa
    Cc: David Miller
    Cc: Lukas Bulwahn
    Signed-off-by: Jouni Hogander
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jouni Hogander
     

18 Dec, 2019

4 commits

  • [ Upstream commit 86c76c09898332143be365c702cf8d586ed4ed21 ]

    A lockdep splat was observed when trying to remove an xdp memory
    model from the table since the mutex was obtained when trying to
    remove the entry, but not before the table walk started:

    Fix the splat by obtaining the lock before starting the table walk.

    Fixes: c3f812cea0d7 ("page_pool: do not release pool until inflight == 0.")
    Reported-by: Grygorii Strashko
    Signed-off-by: Jonathan Lemon
    Tested-by: Grygorii Strashko
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Ilias Apalodimas
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jonathan Lemon
     
  • [ Upstream commit c3f812cea0d7006469d1cf33a4a9f0a12bb4b3a3 ]

    The page pool keeps track of the number of pages in flight, and
    it isn't safe to remove the pool until all pages are returned.

    Disallow removing the pool until all pages are back, so the pool
    is always available for page producers.

    Make the page pool responsible for its own delayed destruction
    instead of relying on XDP, so the page pool can be used without
    the xdp memory model.

    When all pages are returned, free the pool and notify xdp if the
    pool is registered with the xdp memory system. Have the callback
    perform a table walk since some drivers (cpsw) may share the pool
    among multiple xdp_rxq_info.

    Note that the increment of pages_state_release_cnt may result in
    inflight == 0, resulting in the pool being released.

    Fixes: d956a048cd3f ("xdp: force mem allocator removal and periodic warning")
    Signed-off-by: Jonathan Lemon
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Ilias Apalodimas
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jonathan Lemon
     
  • [ Upstream commit d04ac224b1688f005a84f764cfe29844f8e9da08 ]

    The skb_mpls_push was not updating ethertype of an ethernet packet if
    the packet was originally received from a non ARPHRD_ETHER device.

    In the below OVS data path flow, since the device corresponding to
    port 7 is an l3 device (ARPHRD_NONE) the skb_mpls_push function does
    not update the ethertype of the packet even though the previous
    push_eth action had added an ethernet header to the packet.

    recirc_id(0),in_port(7),eth_type(0x0800),ipv4(tos=0/0xfc,ttl=64,frag=no),
    actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
    push_mpls(label=13,tc=0,ttl=64,bos=1,eth_type=0x8847),4

    Fixes: 8822e270d697 ("net: core: move push MPLS functionality from OvS to core helper")
    Signed-off-by: Martin Varghese
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Martin Varghese
     
  • [ Upstream commit 040b5cfbcefa263ccf2c118c4938308606bb7ed8 ]

    The skb_mpls_pop was not updating ethertype of an ethernet packet if the
    packet was originally received from a non ARPHRD_ETHER device.

    In the below OVS data path flow, since the device corresponding to port 7
    is an l3 device (ARPHRD_NONE) the skb_mpls_pop function does not update
    the ethertype of the packet even though the previous push_eth action had
    added an ethernet header to the packet.

    recirc_id(0),in_port(7),eth_type(0x8847),
    mpls(label=12/0xfffff,tc=0/0,ttl=0/0x0,bos=1/1),
    actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
    pop_mpls(eth_type=0x800),4

    Fixes: ed246cee09b9 ("net: core: move pop MPLS functionality from OvS to core helper")
    Signed-off-by: Martin Varghese
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Martin Varghese