08 Jan, 2021

6 commits

  • The function nh_check_attr_group() is called to validate nexthop groups.
    The intention of that code seems to have been to bounce all attributes
    above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
    these attributes except when NHA_FDB attribute is present--then it accepts
    them.

    NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
    bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
    back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.

    But that still leaves NHA_GATEWAY as an attribute that would be accepted in
    FDB nexthop groups (with no meaning), so long as it keeps the address
    family as unspecified:

    # ip nexthop add id 1 fdb via 127.0.0.1
    # ip nexthop add id 10 fdb via default group 1

    The nexthop code is still relatively new and likely not used very broadly,
    and the FDB bits are newer still. Even though there is a reproducer out
    there, it relies on an improbable gateway arguments "via default", "via
    all" or "via any". Given all this, I believe it is OK to reformulate the
    condition to do the right thing and bounce NHA_GATEWAY.

    Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops")
    Signed-off-by: Petr Machata
    Signed-off-by: Ido Schimmel
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski

    Petr Machata
     
  • In case of error, remove the nexthop group entry from the list to which
    it was previously added.

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Petr Machata
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski

    Ido Schimmel
     
  • A reference was not taken for the current nexthop entry, so do not try
    to put it in the error path.

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Petr Machata
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski

    Ido Schimmel
     
  • Conntrack reassembly records the largest fragment size seen in IPCB.
    However, when this gets forwarded/transmitted, fragmentation will only
    be forced if one of the fragmented packets had the DF bit set.

    In that case, a flag in IPCB will force fragmentation even if the
    MTU is large enough.

    This should work fine, but this breaks with ip tunnels.
    Consider client that sends a UDP datagram of size X to another host.

    The client fragments the datagram, so two packets, of size y and z, are
    sent. DF bit is not set on any of these packets.

    Middlebox netfilter reassembles those packets back to single size-X
    packet, before routing decision.

    packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
    isn't set. At output time, ip refragmentation is skipped as well
    because x is still smaller than the mtu of the output device.

    If ttransmit device is an ip tunnel, the packet size increases to
    x+overhead.

    Also, tunnel might be configured to force DF bit on outer header.

    In this case, packet will be dropped (exceeds MTU) and an ICMP error is
    generated back to sender.

    But sender already respects the announced MTU, all the packets that
    it sent did fit the announced mtu.

    Force refragmentation as per original sizes unconditionally so ip tunnel
    will encapsulate the fragments instead.

    The only other solution I see is to place ip refragmentation in
    the ip_tunnel code to handle this case.

    Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet")
    Reported-by: Christian Perle
    Signed-off-by: Florian Westphal
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: Jakub Kicinski

    Florian Westphal
     
  • For some reason ip_tunnel insist on setting the DF bit anyway when the
    inner header has the DF bit set, EVEN if the tunnel was configured with
    'nopmtudisc'.

    This means that the script added in the previous commit
    cannot be made to work by adding the 'nopmtudisc' flag to the
    ip tunnel configuration. Doing so breaks connectivity even for the
    without-conntrack/netfilter scenario.

    When nopmtudisc is set, the tunnel will skip the mtu check, so no
    icmp error is sent to client. Then, because inner header has DF set,
    the outer header gets added with DF bit set as well.

    IP stack then sends an error to itself because the packet exceeds
    the device MTU.

    Fixes: 23a3647bc4f93 ("ip_tunnels: Use skb-len to PMTU check.")
    Cc: Stefano Brivio
    Signed-off-by: Florian Westphal
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: Jakub Kicinski

    Florian Westphal
     
  • Route removal is handled by two code paths. The main removal path is via
    fib6_del_route() which will handle purging any PMTU exceptions from the
    cache, removing all per-cpu copies of the DST entry used by the route, and
    releasing the fib6_info struct.

    The second removal location is during fib6_add_rt2node() during a route
    replacement operation. This path also calls fib6_purge_rt() to handle
    cleaning up the per-cpu copies of the DST entries and releasing the
    fib6_info associated with the older route, but it does not flush any PMTU
    exceptions that the older route had. Since the older route is removed from
    the tree during the replacement, we lose any way of accessing it again.

    As these lingering DSTs and the fib6_info struct are holding references to
    the underlying netdevice struct as well, unregistering that device from the
    kernel can never complete.

    Fixes: 2b760fcf5cfb3 ("ipv6: hook up exception table to store dst cache")
    Signed-off-by: Sean Tranchetti
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
    Signed-off-by: Jakub Kicinski

    Sean Tranchetti
     

06 Jan, 2021

4 commits

  • A null-ptr-deref bug is reported by Hulk Robot like this:
    --------------
    KASAN: null-ptr-deref in range [0x0000000000000128-0x000000000000012f]
    Call Trace:
    qrtr_ns_remove+0x22/0x40 [ns]
    qrtr_proto_fini+0xa/0x31 [qrtr]
    __x64_sys_delete_module+0x337/0x4e0
    do_syscall_64+0x34/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x468ded
    --------------

    When qrtr_ns_init fails in qrtr_proto_init, qrtr_ns_remove which would
    be called later on would raise a null-ptr-deref because qrtr_ns.workqueue
    has been destroyed.

    Fix it by making qrtr_ns_init have a return value and adding a check in
    qrtr_proto_init.

    Reported-by: Hulk Robot
    Signed-off-by: Qinglang Miao
    Signed-off-by: David S. Miller

    Qinglang Miao
     
  • VLAN checks for NETREG_UNINITIALIZED to distinguish between
    registration failure and unregistration in progress.

    Since commit cb626bf566eb ("net-sysfs: Fix reference count leak")
    registration failure may, however, result in NETREG_UNREGISTERED
    as well as NETREG_UNINITIALIZED.

    This fix is similer to cebb69754f37 ("rtnetlink: Fix
    memory(net_device) leak when ->newlink fails")

    Fixes: cb626bf566eb ("net-sysfs: Fix reference count leak")
    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Without crc32 support, this fails to link:

    arm-linux-gnueabi-ld: net/wireless/scan.o: in function `cfg80211_scan_6ghz':
    scan.c:(.text+0x928): undefined reference to `crc32_le'

    Fixes: c8cb5b854b40 ("nl80211/cfg80211: support 6 GHz scanning")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • Pull networking fixes from Jakub Kicinski:
    "Networking fixes, including fixes from netfilter, wireless and bpf
    trees.

    Current release - regressions:

    - mt76: fix NULL pointer dereference in mt76u_status_worker and
    mt76s_process_tx_queue

    - net: ipa: fix interconnect enable bug

    Current release - always broken:

    - netfilter: fixes possible oops in mtype_resize in ipset

    - ath11k: fix number of coding issues found by static analysis tools
    and spurious error messages

    Previous releases - regressions:

    - e1000e: re-enable s0ix power saving flows for systems with the
    Intel i219-LM Ethernet controllers to fix power use regression

    - virtio_net: fix recursive call to cpus_read_lock() to avoid a
    deadlock

    - ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()

    - sysfs: take the rtnl lock around XPS configuration

    - xsk: fix memory leak for failed bind and rollback reservation at
    NETDEV_TX_BUSY

    - r8169: work around power-saving bug on some chip versions

    Previous releases - always broken:

    - dcb: validate netlink message in DCB handler

    - tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
    to prevent unnecessary retries

    - vhost_net: fix ubuf refcount when sendmsg fails

    - bpf: save correct stopping point in file seq iteration

    - ncsi: use real net-device for response handler

    - neighbor: fix div by zero caused by a data race (TOCTOU)

    - bareudp: fix use of incorrect min_headroom size and a false
    positive lockdep splat from the TX lock

    - mvpp2:
    - clear force link UP during port init procedure in case
    bootloader had set it
    - add TCAM entry to drop flow control pause frames
    - fix PPPoE with ipv6 packet parsing
    - fix GoP Networking Complex Control config of port 3
    - fix pkt coalescing IRQ-threshold configuration

    - xsk: fix race in SKB mode transmit with shared cq

    - ionic: account for vlan tag len in rx buffer len

    - stmmac: ignore the second clock input, current clock framework does
    not handle exclusive clock use well, other drivers may reconfigure
    the second clock

    Misc:

    - ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
    existing scheme"

    * tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
    net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
    net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
    net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event
    r8169: work around power-saving bug on some chip versions
    net: usb: qmi_wwan: add Quectel EM160R-GL
    selftests: mlxsw: Set headroom size of correct port
    net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag
    ibmvnic: fix: NULL pointer dereference.
    docs: networking: packet_mmap: fix old config reference
    docs: networking: packet_mmap: fix formatting for C macros
    vhost_net: fix ubuf refcount incorrectly when sendmsg fails
    bareudp: Fix use of incorrect min_headroom size
    bareudp: set NETIF_F_LLTX flag
    net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
    atlantic: remove architecture depends
    erspan: fix version 1 check in gre_parse_header()
    net: hns: fix return value check in __lb_other_process()
    net: sched: prevent invalid Scell_log shift count
    net: neighbor: fix a crash caused by mod zero
    ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
    ...

    Linus Torvalds
     

05 Jan, 2021

2 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Missing sanitization of rateest userspace string, bug has been
    triggered by syzbot, patch from Florian Westphal.

    2) Report EOPNOTSUPP on missing set features in nft_dynset, otherwise
    error reporting to userspace via EINVAL is misleading since this is
    reserved for malformed netlink requests.

    3) New binaries with old kernels might silently accept several set
    element expressions. New binaries set on the NFT_SET_EXPR and
    NFT_DYNSET_F_EXPR flags to request for several expressions per
    element, hence old kernels which do not support for this bail out
    with EOPNOTSUPP.

    * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
    netfilter: nftables: add set expression flags
    netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
    netfilter: xt_RATEEST: reject non-null terminated string from userspace
    ====================

    Link: https://lore.kernel.org/r/20210103192920.18639-1-pablo@netfilter.org
    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • In lapb_device_event, lapb_devtostruct is called to get a reference to
    an object of "struct lapb_cb". lapb_devtostruct increases the refcount
    of the object and returns a pointer to it. However, we didn't decrease
    the refcount after we finished using the pointer. This patch fixes this
    problem.

    Fixes: a4989fa91110 ("net/lapb: support netdev events")
    Cc: Martin Schiller
    Signed-off-by: Xie He
    Link: https://lore.kernel.org/r/20201231174331.64539-1-xie.he.0141@gmail.com
    Signed-off-by: Jakub Kicinski

    Xie He
     

29 Dec, 2020

12 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2020-12-28

    The following pull-request contains BPF updates for your *net* tree.

    There is a small merge conflict between bpf tree commit 69ca310f3416
    ("bpf: Save correct stopping point in file seq iteration") and net tree
    commit 66ed594409a1 ("bpf/task_iter: In task_file_seq_get_next use
    task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore
    in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to
    the error path:

    [...]
    curr_task = task_seq_get_next(ns, &curr_tid, true);
    if (!curr_task) {
    info->task = NULL;
    info->tid = curr_tid;
    return NULL;
    }

    /* set info->task and info->tid */
    [...]

    We've added 10 non-merge commits during the last 9 day(s) which contain
    a total of 11 files changed, 75 insertions(+), 20 deletions(-).

    The main changes are:

    1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and
    fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson.

    2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling
    point to BPF hashtab initialization, from Eric Dumazet.

    3) Fix a splat in task iterator by saving the correct stopping point in the
    seq file iteration, from Jonathan Lemon.

    4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY
    errors on update/deletes, from Andrii Nakryiko.

    5) Fix BPF selftest error reporting to something more user friendly if the
    vmlinux BTF cannot be found, from Kamal Mostafa.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
    have an erspan header. So the check in gre_parse_header() is wrong,
    we have to distinguish version 1 from version 0.

    We can just check the gre header length like is_erspan_type1().

    Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup")
    Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
    Cc: William Tu
    Cc: Lorenzo Bianconi
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • Check Scell_log shift size in red_check_params() and modify all callers
    of red_check_params() to pass Scell_log.

    This prevents a shift out-of-bounds as detected by UBSAN:
    UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22
    shift exponent 72 is too large for 32-bit type 'int'

    Fixes: 8afa10cbe281 ("net_sched: red: Avoid illegal values")
    Signed-off-by: Randy Dunlap
    Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com
    Cc: Nogah Frankel
    Cc: Jamal Hadi Salim
    Cc: Cong Wang
    Cc: Jiri Pirko
    Cc: netdev@vger.kernel.org
    Cc: "David S. Miller"
    Cc: Jakub Kicinski
    Signed-off-by: David S. Miller

    Randy Dunlap
     
  • pneigh_enqueue() tries to obtain a random delay by mod
    NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
    migth be zero at that point because someone could write zero
    to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
    callers check it.

    This patch uses prandom_u32_max() to get a random delay instead
    which avoids potential division by zero.

    Signed-off-by: weichenchen
    Signed-off-by: David S. Miller

    weichenchen
     
  • RT_TOS() only clears one of the ECN bits. Therefore, when
    fib_compute_spec_dst() resorts to a fib lookup, it can return
    different results depending on the value of the second ECN bit.

    For example, ECT(0) and ECT(1) packets could be treated differently.

    $ ip netns add ns0
    $ ip netns add ns1
    $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
    $ ip -netns ns0 link set dev lo up
    $ ip -netns ns1 link set dev lo up
    $ ip -netns ns0 link set dev veth01 up
    $ ip -netns ns1 link set dev veth10 up

    $ ip -netns ns0 address add 192.0.2.10/24 dev veth01
    $ ip -netns ns1 address add 192.0.2.11/24 dev veth10

    $ ip -netns ns1 address add 192.0.2.21/32 dev lo
    $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
    $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0

    With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
    (ping uses -Q to set all TOS and ECN bits):

    $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
    [...]
    64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms

    But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
    because the "tos 4" route isn't matched:

    $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
    [...]
    64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms

    After this patch the ECN bits don't affect the result anymore:

    $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
    [...]
    64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms

    Fixes: 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper.")
    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault
     
  • the following syzkaller reproducer:

    r0 = socket$inet_mptcp(0x2, 0x1, 0x106)
    bind$inet(r0, &(0x7f0000000080)={0x2, 0x4e24, @multicast2}, 0x10)
    connect$inet(r0, &(0x7f0000000480)={0x2, 0x4e24, @local}, 0x10)
    sendto$inet(r0, &(0x7f0000000100)="f6", 0xffffffe7, 0xc000, 0x0, 0x0)

    systematically triggers the following warning:

    WARNING: CPU: 2 PID: 8618 at net/core/stream.c:208 sk_stream_kill_queues+0x3fa/0x580
    Modules linked in:
    CPU: 2 PID: 8618 Comm: syz-executor Not tainted 5.10.0+ #334
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/04
    RIP: 0010:sk_stream_kill_queues+0x3fa/0x580
    Code: df 48 c1 ea 03 0f b6 04 02 84 c0 74 04 3c 03 7e 40 8b ab 20 02 00 00 e9 64 ff ff ff e8 df f0 81 2
    RSP: 0018:ffffc9000290fcb0 EFLAGS: 00010293
    RAX: ffff888011cb8000 RBX: 0000000000000000 RCX: ffffffff86eecf0e
    RDX: 0000000000000000 RSI: ffffffff86eecf6a RDI: 0000000000000005
    RBP: 0000000000000e28 R08: ffff888011cb8000 R09: fffffbfff1f48139
    R10: ffffffff8fa409c7 R11: fffffbfff1f48138 R12: ffff8880215e6220
    R13: ffffffff8fa409c0 R14: ffffc9000290fd30 R15: 1ffff92000521fa2
    FS: 00007f41c78f4800(0000) GS:ffff88802d000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f95c803d088 CR3: 0000000025ed2000 CR4: 00000000000006f0
    Call Trace:
    __mptcp_destroy_sock+0x4f5/0x8e0
    mptcp_close+0x5e2/0x7f0
    inet_release+0x12b/0x270
    __sock_release+0xc8/0x270
    sock_close+0x18/0x20
    __fput+0x272/0x8e0
    task_work_run+0xe0/0x1a0
    exit_to_user_mode_prepare+0x1df/0x200
    syscall_exit_to_user_mode+0x19/0x50
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    userspace programs provide arbitrarily high values of 'len' in sendmsg():
    this is causing integer overflow of 'amount'. Cap forward allocation to 1
    megabyte: higher values are not really useful.

    Suggested-by: Paolo Abeni
    Fixes: e93da92896bc ("mptcp: implement wmem reservation")
    Signed-off-by: Davide Caratti
    Link: https://lore.kernel.org/r/3334d00d8b2faecafdfab9aa593efcbf61442756.1608584474.git.dcaratti@redhat.com
    Signed-off-by: Jakub Kicinski

    Davide Caratti
     
  • Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
    protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
    see an actual bug being triggered, but let's be safe here and take the
    rtnl lock while accessing the map in sysfs.

    Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     
  • Two race conditions can be triggered when storing xps rxqs, resulting in
    various oops and invalid memory accesses:

    1. Calling netdev_set_num_tc while netif_set_xps_queue:

    - netif_set_xps_queue uses dev->tc_num as one of the parameters to
    compute the size of new_dev_maps when allocating it. dev->tc_num is
    also used to access the map, and the compiler may generate code to
    retrieve this field multiple times in the function.

    - netdev_set_num_tc sets dev->tc_num.

    If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
    is set to a higher value through netdev_set_num_tc, later accesses to
    new_dev_maps in netif_set_xps_queue could lead to accessing memory
    outside of new_dev_maps; triggering an oops.

    2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

    2.1. netdev_set_num_tc starts by resetting the xps queues,
    dev->tc_num isn't updated yet.

    2.2. netif_set_xps_queue is called, setting up the map with the
    *old* dev->num_tc.

    2.3. netdev_set_num_tc updates dev->tc_num.

    2.4. Later accesses to the map lead to out of bound accesses and
    oops.

    A similar issue can be found with netdev_reset_tc.

    One way of triggering this is to set an iface up (for which the driver
    uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
    xps_rxqs in a concurrent thread. With the right timing an oops is
    triggered.

    Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
    and netdev_reset_tc should be mutually exclusive. We do that by taking
    the rtnl lock in xps_rxqs_store.

    Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     
  • Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
    protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
    see an actual bug being triggered, but let's be safe here and take the
    rtnl lock while accessing the map in sysfs.

    Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     
  • Two race conditions can be triggered when storing xps cpus, resulting in
    various oops and invalid memory accesses:

    1. Calling netdev_set_num_tc while netif_set_xps_queue:

    - netif_set_xps_queue uses dev->tc_num as one of the parameters to
    compute the size of new_dev_maps when allocating it. dev->tc_num is
    also used to access the map, and the compiler may generate code to
    retrieve this field multiple times in the function.

    - netdev_set_num_tc sets dev->tc_num.

    If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
    is set to a higher value through netdev_set_num_tc, later accesses to
    new_dev_maps in netif_set_xps_queue could lead to accessing memory
    outside of new_dev_maps; triggering an oops.

    2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

    2.1. netdev_set_num_tc starts by resetting the xps queues,
    dev->tc_num isn't updated yet.

    2.2. netif_set_xps_queue is called, setting up the map with the
    *old* dev->num_tc.

    2.3. netdev_set_num_tc updates dev->tc_num.

    2.4. Later accesses to the map lead to out of bound accesses and
    oops.

    A similar issue can be found with netdev_reset_tc.

    One way of triggering this is to set an iface up (for which the driver
    uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
    xps_cpus in a concurrent thread. With the right timing an oops is
    triggered.

    Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
    and netdev_reset_tc should be mutually exclusive. We do that by taking
    the rtnl lock in xps_cpus_store.

    Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     
  • crypto_shash_setkey() and crypto_aead_setkey() will do a (small)
    GFP_ATOMIC allocation to align the key if it isn't suitably aligned.
    It's not a big deal, but at the same time easy to avoid.

    The actual alignment requirement is dynamic, queryable with
    crypto_shash_alignmask() and crypto_aead_alignmask(), but shouldn't
    be stricter than 16 bytes for our algorithms.

    Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • auth_signature frame is 68 bytes in plain mode and 96 bytes in
    secure mode but we are requesting 68 bytes in both modes. By luck,
    this doesn't actually result in any invalid memory accesses because
    the allocation is satisfied out of kmalloc-96 slab and so exactly
    96 bytes are allocated, but KASAN rightfully complains.

    Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
    Reported-by: Luis Henriques
    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

28 Dec, 2020

2 commits

  • The set flag NFT_SET_EXPR provides a hint to the kernel that userspace
    supports for multiple expressions per set element. In the same
    direction, NFT_DYNSET_F_EXPR specifies that dynset expression defines
    multiple expressions per set element.

    This allows new userspace software with old kernels to bail out with
    EOPNOTSUPP. This update is similar to ef516e8625dd ("netfilter:
    nf_tables: reintroduce the NFT_SET_CONCAT flag"). The NFT_SET_EXPR flag
    needs to be set on when the NFTA_SET_EXPRESSIONS attribute is specified.
    The NFT_SET_EXPR flag is not set on with NFTA_SET_EXPR to retain
    backward compatibility in old userspace binaries.

    Fixes: 48b0ae046ee9 ("netfilter: nftables: netlink support for several set element expressions")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • If userspace requests a feature which is not available the original set
    definition, then bail out with EOPNOTSUPP. If userspace sends
    unsupported dynset flags (new feature not supported by this kernel),
    then report EOPNOTSUPP to userspace. EINVAL should be only used to
    report malformed netlink messages from userspace.

    Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

27 Dec, 2020

1 commit

  • syzbot reports:
    detected buffer overflow in strlen
    [..]
    Call Trace:
    strlen include/linux/string.h:325 [inline]
    strlcpy include/linux/string.h:348 [inline]
    xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143

    strlcpy assumes src is a c-string. Check info->name before its used.

    Reported-by: syzbot+e86f7c428c8c50db65b4@syzkaller.appspotmail.com
    Fixes: 5859034d7eb8793 ("[NETFILTER]: x_tables: add RATEEST target")
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

24 Dec, 2020

2 commits

  • When aggregating ncsi interfaces and dedicated interfaces to bond
    interfaces, the ncsi response handler will use the wrong net device to
    find ncsi_dev, so that the ncsi interface will not work properly.
    Here, we use the original net device to fix it.

    Fixes: 138635cc27c9 ("net/ncsi: NCSI response packet handler")
    Signed-off-by: John Wang
    Link: https://lore.kernel.org/r/20201223055523.2069-1-wangzhiqiang.bj@bytedance.com
    Signed-off-by: Jakub Kicinski

    John Wang
     
  • DCB uses the same handler function for both RTM_GETDCB and RTM_SETDCB
    messages. dcb_doit() bounces RTM_SETDCB mesasges if the user does not have
    the CAP_NET_ADMIN capability.

    However, the operation to be performed is not decided from the DCB message
    type, but from the DCB command. Thus DCB_CMD_*_GET commands are used for
    reading DCB objects, the corresponding SET and DEL commands are used for
    manipulation.

    The assumption is that set-like commands will be sent via an RTM_SETDCB
    message, and get-like ones via RTM_GETDCB. However, this assumption is not
    enforced.

    It is therefore possible to manipulate DCB objects without CAP_NET_ADMIN
    capability by sending the corresponding command in an RTM_GETDCB message.
    That is a bug. Fix it by validating the type of the request message against
    the type used for the response.

    Fixes: 2f90b8657ec9 ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
    Signed-off-by: Petr Machata
    Link: https://lore.kernel.org/r/a2a9b88418f3a58ef211b718f2970128ef9e3793.1608673640.git.me@pmachata.org
    Signed-off-by: Jakub Kicinski

    Petr Machata
     

22 Dec, 2020

1 commit

  • Pull 9p update from Dominique Martinet:

    - fix long-standing limitation on open-unlink-fop pattern

    - add refcount to p9_fid (fixes the above and will allow for more
    cleanups and simplifications in the future)

    * tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux:
    9p: Remove unnecessary IS_ERR() check
    9p: Uninitialized variable in v9fs_writeback_fid()
    9p: Fix writeback fid incorrectly being attached to dentry
    9p: apply review requests for fid refcounting
    9p: add refcount to p9_fid struct
    fs/9p: search open fids first
    fs/9p: track open fids
    fs/9p: fix create-unlink-getattr idiom

    Linus Torvalds
     

19 Dec, 2020

3 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    1) Incorrect loop in error path of nft_set_elem_expr_clone(),
    from Colin Ian King.

    2) Missing xt_table_get_private_protected() to access table
    private data in x_tables, from Subash Abhinov Kasiviswanathan.

    3) Possible oops in ipset hash type resize, from Vasily Averin.

    4) Fix shift-out-of-bounds in ipset hash type, also from Vasily.

    * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
    netfilter: ipset: fix shift-out-of-bounds in htable_bits()
    netfilter: ipset: fixes possible oops in mtype_resize
    netfilter: x_tables: Update remaining dereference to RCU
    netfilter: nftables: fix incorrect increment of loop counter
    ====================

    Link: https://lore.kernel.org/r/20201218120409.3659-1-pablo@netfilter.org
    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • taprio_graft() can insert a NULL element in the array of child qdiscs. As
    a consquence, taprio_reset() might not reset child qdiscs completely, and
    taprio_destroy() might leak resources. Fix it by ensuring that loops that
    iterate over q->qdiscs[] don't end when they find the first NULL item.

    Fixes: 44d4775ca518 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
    Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
    Suggested-by: Jakub Kicinski
    Signed-off-by: Davide Caratti
    Link: https://lore.kernel.org/r/13edef6778fef03adc751582562fba4a13e06d6a.1608240532.git.dcaratti@redhat.com
    Signed-off-by: Jakub Kicinski

    Davide Caratti
     
  • On 64-bit systems the packet procfs header field names following 'sk'
    are not aligned correctly:

    sk RefCnt Type Proto Iface R Rmem User Inode
    00000000605d2c64 3 3 0003 7 1 450880 0 16643
    00000000080e9b80 2 2 0000 0 0 0 0 17404
    00000000b23b8a00 2 2 0000 0 0 0 0 17421
    ...

    With this change field names are correctly aligned:

    sk RefCnt Type Proto Iface R Rmem User Inode
    000000005c3b1d97 3 3 0003 7 1 21568 0 16178
    000000007be55bb7 3 3 fbce 8 1 0 0 16250
    00000000be62127d 3 3 fbcd 8 1 0 0 16254
    ...

    Signed-off-by: Baruch Siach
    Link: https://lore.kernel.org/r/54917251d8433735d9a24e935a6cb8eb88b4058a.1608103684.git.baruch@tkos.co.il
    Signed-off-by: Jakub Kicinski

    Baruch Siach
     

18 Dec, 2020

7 commits

  • Rollback the reservation in the completion ring when we get a
    NETDEV_TX_BUSY. When this error is received from the driver, we are
    supposed to let the user application retry the transmit again. And in
    order to do this, we need to roll back the failed send so it can be
    retried. Unfortunately, we did not cancel the reservation we had made
    in the completion ring. By not doing this, we actually make the
    completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
    after enough of these errors the completion ring will be of size zero
    and transmit will stop working.

    Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
    error.

    Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
    Reported-by: Xuan Zhuo
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • Fix a race when multiple sockets are simultaneously calling sendto()
    when the completion ring is shared in the SKB case. This is the case
    when you share the same netdev and queue id through the
    XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
    be in xsk_generic_xmit() and call the backpressure mechanism in
    xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
    specific scenario, a race might occur since the rings are
    single-producer single-consumer.

    Fix this by moving the tx_completion_lock from the socket to the pool
    as the pool is shared between the sockets that share the completion
    ring. (The pool is not shared when this is not the case.) And then
    protect the accesses to xskq_prod_reserve() with this lock. The
    tx_completion_lock is renamed cq_lock to better reflect that it
    protects accesses to the potentially shared completion ring.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Reported-by: Xuan Zhuo
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • Fix a possible memory leak when a bind of an AF_XDP socket fails. When
    the fill and completion rings are created, they are tied to the
    socket. But when the buffer pool is later created at bind time, the
    ownership of these two rings are transferred to the buffer pool as
    they might be shared between sockets (and the buffer pool cannot be
    created until we know what we are binding to). So, before the buffer
    pool is created, these two rings are cleaned up with the socket, and
    after they have been transferred they are cleaned up together with
    the buffer pool.

    The problem is that ownership was transferred before it was absolutely
    certain that the buffer pool could be created and initialized
    correctly and when one of these errors occurred, the fill and
    completion rings did neither belong to the socket nor the pool and
    where therefore leaked. Solve this by moving the ownership transfer
    to the point where the buffer pool has been completely set up and
    there is no way it can fail.

    Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
    Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • Pull networking fixes from Jakub Kicinski:
    "Current release - always broken:

    - net/smc: fix access to parent of an ib device

    - devlink: use _BITUL() macro instead of BIT() in the UAPI header

    - handful of mptcp fixes

    Previous release - regressions:

    - intel: AF_XDP: clear the status bits for the next_to_use descriptor

    - dpaa2-eth: fix the size of the mapped SGT buffer

    Previous release - always broken:

    - mptcp: fix security context on server socket

    - ethtool: fix string set id check

    - ethtool: fix error paths in ethnl_set_channels()

    - lan743x: fix rx_napi_poll/interrupt ping-pong

    - qca: ar9331: fix sleeping function called from invalid context bug"

    * tag 'net-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (32 commits)
    net/sched: sch_taprio: reset child qdiscs before freeing them
    nfp: move indirect block cleanup to flower app stop callback
    octeontx2-af: Fix undetected unmap PF error check
    net: nixge: fix spelling mistake in Kconfig: "Instuments" -> "Instruments"
    qlcnic: Fix error code in probe
    mptcp: fix pending data accounting
    mptcp: push pending frames when subflow has free space
    mptcp: properly annotate nested lock
    mptcp: fix security context on server socket
    net/mlx5: Fix compilation warning for 32-bit platform
    mptcp: clear use_ack and use_map when dropping other suboptions
    devlink: use _BITUL() macro instead of BIT() in the UAPI header
    net: korina: fix return value
    net/smc: fix access to parent of an ib device
    ethtool: fix error paths in ethnl_set_channels()
    nfc: s3fwrn5: Remove unused NCI prop commands
    nfc: s3fwrn5: Remove the delay for NFC sleep
    phy: fix kdoc warning
    tipc: do sanity check payload of a netlink message
    use __netdev_notify_peers in hyperv
    ...

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Features:

    - NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
    support

    - A series of patches to improve readdir performance, particularly
    with large directories

    - Basic support for using NFS/RDMA with the pNFS files and flexfiles
    drivers

    - Micro-optimisations for RDMA

    - RDMA tracing improvements

    Bugfixes:

    - Fix a long standing bug with xs_read_xdr_buf() when receiving
    partial pages (Dan Aloni)

    - Various fixes for getxattr and listxattr, when used over non-TCP
    transports

    - Fixes for containerised NFS from Sargun Dhillon

    - switch nfsiod to be an UNBOUND workqueue (Neil Brown)

    - READDIR should not ask for security label information if there is
    no LSM policy (Olga Kornievskaia)

    - Avoid using interval-based rebinding with TCP in lockd (Calum
    Mackay)

    - A series of RPC and NFS layer fixes to support the NFSv4.2
    READ_PLUS code

    - A couple of fixes for pnfs/flexfiles read failover

    Cleanups:

    - Various cleanups for the SUNRPC xdr code in conjunction with the
    READ_PLUS fixes"

    * tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
    NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
    pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read
    NFSv4/pnfs: Add tracing for the deviceid cache
    fs/lockd: convert comma to semicolon
    NFSv4.2: fix error return on memory allocation failure
    NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
    NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
    NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
    NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
    NFSv4.2: decode_read_plus_hole() needs to check the extent offset
    NFSv4.2: decode_read_plus_data() must skip padding after data segment
    NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
    SUNRPC: When expanding the buffer, we may need grow the sparse pages
    SUNRPC: Cleanup - constify a number of xdr_buf helpers
    SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
    SUNRPC: _copy_to/from_pages() now check for zero length
    SUNRPC: Cleanup xdr_shrink_bufhead()
    SUNRPC: Fix xdr_expand_hole()
    SUNRPC: Fixes for xdr_align_data()
    SUNRPC: _shift_data_left/right_pages should check the shift length
    ...

    Linus Torvalds
     
  • Pull ceph updates from Ilya Dryomov:
    "The big ticket item here is support for msgr2 on-wire protocol, which
    adds the option of full in-transit encryption using AES-GCM algorithm
    (myself).

    On top of that we have a series to avoid intermittent errors during
    recovery with recover_session=clean and some MDS request encoding work
    from Jeff, a cap handling fix and assorted observability improvements
    from Luis and Xiubo and a good number of cleanups.

    Luis also ran into a corner case with quotas which sadly means that we
    are back to denying cross-quota-realm renames"

    * tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits)
    libceph: drop ceph_auth_{create,update}_authorizer()
    libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
    libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
    libceph: introduce connection modes and ms_mode option
    libceph, rbd: ignore addr->type while comparing in some cases
    libceph, ceph: get and handle cluster maps with addrvecs
    libceph: factor out finish_auth()
    libceph: drop ac->ops->name field
    libceph: amend cephx init_protocol() and build_request()
    libceph, ceph: incorporate nautilus cephx changes
    libceph: safer en/decoding of cephx requests and replies
    libceph: more insight into ticket expiry and invalidation
    libceph: move msgr1 protocol specific fields to its own struct
    libceph: move msgr1 protocol implementation to its own file
    libceph: separate msgr1 protocol implementation
    libceph: export remaining protocol independent infrastructure
    libceph: export zero_page
    libceph: rename and export con->flags bits
    libceph: rename and export con->state states
    libceph: make con->state an int
    ...

    Linus Torvalds
     
  • syzkaller shows that packets can still be dequeued while taprio_destroy()
    is running. Let sch_taprio use the reset() function to cancel the advance
    timer and drop all skbs from the child qdiscs.

    Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
    Link: https://syzkaller.appspot.com/bug?id=f362872379bf8f0017fb667c1ab158f2d1e764ae
    Reported-by: syzbot+8971da381fb5a31f542d@syzkaller.appspotmail.com
    Signed-off-by: Davide Caratti
    Acked-by: Vinicius Costa Gomes
    Link: https://lore.kernel.org/r/63b6d79b0e830ebb0283e020db4df3cdfdfb2b94.1608142843.git.dcaratti@redhat.com
    Signed-off-by: Jakub Kicinski

    Davide Caratti