20 Nov, 2020

1 commit

  • Currently contains declarations for both SHA-1 and SHA-2,
    and contains declarations for SHA-3.

    This organization is inconsistent, but more importantly SHA-1 is no
    longer considered to be cryptographically secure. So to the extent
    possible, SHA-1 shouldn't be grouped together with any of the other SHA
    versions, and usage of it should be phased out.

    Therefore, split into two headers and
    , and make everyone explicitly specify whether they want
    the declarations for SHA-1, SHA-2, or both.

    This avoids making the SHA-1 declarations visible to files that don't
    want anything to do with SHA-1. It also prepares for potentially moving
    sha1.h into a new insecure/ or dangerous/ directory.

    Signed-off-by: Eric Biggers
    Acked-by: Ard Biesheuvel
    Acked-by: Jason A. Donenfeld
    Signed-off-by: Herbert Xu

    Eric Biggers
     

26 Oct, 2020

1 commit

  • Commit 453431a54934 ("mm, treewide: rename kzfree() to
    kfree_sensitive()") renamed kzfree() to kfree_sensitive(),
    but it left a compatibility definition of kzfree() to avoid
    being too disruptive.

    Since then a few more instances of kzfree() have slipped in.

    Just get rid of them and remove the compatibility definition
    once and for all.

    Signed-off-by: Eric Biggers
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

25 Oct, 2020

1 commit

  • With the removal of the interrupt perturbations in previous random32
    change (random32: make prandom_u32() output unpredictable), the PRNG
    has become 100% deterministic again. While SipHash is expected to be
    way more robust against brute force than the previous Tausworthe LFSR,
    there's still the risk that whoever has even one temporary access to
    the PRNG's internal state is able to predict all subsequent draws till
    the next reseed (roughly every minute). This may happen through a side
    channel attack or any data leak.

    This patch restores the spirit of commit f227e3ec3b5c ("random32: update
    the net random state on interrupt and activity") in that it will perturb
    the internal PRNG's statee using externally collected noise, except that
    it will not pick that noise from the random pool's bits nor upon
    interrupt, but will rather combine a few elements along the Tx path
    that are collectively hard to predict, such as dev, skb and txq
    pointers, packet length and jiffies values. These ones are combined
    using a single round of SipHash into a single long variable that is
    mixed with the net_rand_state upon each invocation.

    The operation was inlined because it produces very small and efficient
    code, typically 3 xor, 2 add and 2 rol. The performance was measured
    to be the same (even very slightly better) than before the switch to
    SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
    (i40e), the connection rate dropped from 556k/s to 555k/s while the
    SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.

    Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
    Cc: George Spelvin
    Cc: Amit Klein
    Cc: Eric Dumazet
    Cc: "Jason A. Donenfeld"
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: tytso@mit.edu
    Cc: Florian Westphal
    Cc: Marc Plumb
    Tested-by: Sedat Dilek
    Signed-off-by: Willy Tarreau

    Willy Tarreau
     

24 Oct, 2020

1 commit

  • Pull networking fixes from Jakub Kicinski:
    "Cross-tree/merge window issues:

    - rtl8150: don't incorrectly assign random MAC addresses; fix late in
    the 5.9 cycle started depending on a return code from a function
    which changed with the 5.10 PR from the usb subsystem

    Current release regressions:

    - Revert "virtio-net: ethtool configurable RXCSUM", it was causing
    crashes at probe when control vq was not negotiated/available

    Previous release regressions:

    - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO
    bus, only first device would be probed correctly

    - nexthop: Fix performance regression in nexthop deletion by
    effectively switching from recently added synchronize_rcu() to
    synchronize_rcu_expedited()

    - netsec: ignore 'phy-mode' device property on ACPI systems; the
    property is not populated correctly by the firmware, but firmware
    configures the PHY so just keep boot settings

    Previous releases - always broken:

    - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing
    bulk transfers getting "stuck"

    - icmp: randomize the global rate limiter to prevent attackers from
    getting useful signal

    - r8169: fix operation under forced interrupt threading, make the
    driver always use hard irqs, even on RT, given the handler is light
    and only wants to schedule napi (and do so through a _irqoff()
    variant, preferably)

    - bpf: Enforce pointer id generation for all may-be-null register
    type to avoid pointers erroneously getting marked as null-checked

    - tipc: re-configure queue limit for broadcast link

    - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN
    tunnels

    - fix various issues in chelsio inline tls driver

    Misc:

    - bpf: improve just-added bpf_redirect_neigh() helper api to support
    supplying nexthop by the caller - in case BPF program has already
    done a lookup we can avoid doing another one

    - remove unnecessary break statements

    - make MCTCP not select IPV6, but rather depend on it"

    * tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
    tcp: fix to update snd_wl1 in bulk receiver fast path
    net: Properly typecast int values to set sk_max_pacing_rate
    netfilter: nf_fwd_netdev: clear timestamp in forwarding path
    ibmvnic: save changed mac address to adapter->mac_addr
    selftests: mptcp: depends on built-in IPv6
    Revert "virtio-net: ethtool configurable RXCSUM"
    rtnetlink: fix data overflow in rtnl_calcit()
    net: ethernet: mtk-star-emac: select REGMAP_MMIO
    net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup
    net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device
    bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static
    bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh()
    bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop
    mptcp: depends on IPV6 but not as a module
    sfc: move initialisation of efx->filter_sem to efx_init_struct()
    mpls: load mpls_gso after mpls_iptunnel
    net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels
    net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action()
    net: dsa: bcm_sf2: make const array static, makes object smaller
    mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it
    ...

    Linus Torvalds
     

23 Oct, 2020

6 commits

  • In the header prediction fast path for a bulk data receiver, if no
    data is newly acknowledged then we do not call tcp_ack() and do not
    call tcp_ack_update_window(). This means that a bulk receiver that
    receives large amounts of data can have the incoming sequence numbers
    wrap, so that the check in tcp_may_update_window fails:
    after(ack_seq, tp->snd_wl1)

    If the incoming receive windows are zero in this state, and then the
    connection that was a bulk data receiver later wants to send data,
    that connection can find itself persistently rejecting the window
    updates in incoming ACKs. This means the connection can persistently
    fail to discover that the receive window has opened, which in turn
    means that the connection is unable to send anything, and the
    connection's sending process can get permanently "stuck".

    The fix is to update snd_wl1 in the header prediction fast path for a
    bulk data receiver, so that it keeps up and does not see wrapping
    problems.

    This fix is based on a very nice and thorough analysis and diagnosis
    by Apollon Oikonomopoulos (see link below).

    This is a stable candidate but there is no Fixes tag here since the
    bug predates current git history. Just for fun: looks like the bug
    dates back to when header prediction was added in Linux v2.1.8 in Nov
    1996. In that version tcp_rcv_established() was added, and the code
    only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer:
    receiver" code path it does not call tcp_ack(). This fix seems to
    apply cleanly at least as far back as v3.2.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Neal Cardwell
    Reported-by: Apollon Oikonomopoulos
    Tested-by: Apollon Oikonomopoulos
    Link: https://www.spinics.net/lists/netdev/msg692430.html
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.com
    Signed-off-by: Jakub Kicinski

    Neal Cardwell
     
  • In setsockopt(SO_MAX_PACING_RATE) on 64bit systems, sk_max_pacing_rate,
    after extended from 'u32' to 'unsigned long', takes unintentionally
    hiked value whenever assigned from an 'int' value with MSB=1, due to
    binary sign extension in promoting s32 to u64, e.g. 0x80000000 becomes
    0xFFFFFFFF80000000.

    Thus inflated sk_max_pacing_rate causes subsequent getsockopt to return
    ~0U unexpectedly. It may also result in increased pacing rate.

    Fix by explicitly casting the 'int' value to 'unsigned int' before
    assigning it to sk_max_pacing_rate, for zero extension to happen.

    Fixes: 76a9ebe811fb ("net: extend sk_pacing_rate to unsigned long")
    Signed-off-by: Ji Li
    Signed-off-by: Ke Li
    Reviewed-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201022064146.79873-1-keli@akamai.com
    Signed-off-by: Jakub Kicinski

    Ke Li
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    1) Update debugging in IPVS tcp protocol handler to make it easier
    to understand, from longguang.yue

    2) Update TCP tracker to deal with keepalive packet after
    re-registration, from Franceso Ruggeri.

    3) Missing IP6SKB_FRAGMENTED from netfilter fragment reassembly,
    from Georg Kohmann.

    4) Fix bogus packet drop in ebtables nat extensions, from
    Thimothee Cocault.

    5) Fix typo in flowtable documentation.

    6) Reset skb timestamp in nft_fwd_netdev.
    ====================

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2020-10-22

    1) Fix enforcing NULL check in verifier for new helper return types of
    RET_PTR_TO_{BTF_ID,MEM_OR_BTF_ID}_OR_NULL, from Martin KaFai Lau.

    2) Fix bpf_redirect_neigh() helper API before it becomes frozen by adding
    nexthop information as argument, from Toke Høiland-Jørgensen.

    3) Guard & fix compilation of bpf_tail_call_static() when __bpf__ arch is
    not defined by compiler or clang too old, from Daniel Borkmann.

    4) Remove misplaced break after return in attach_type_to_prog_type(), from
    Tom Rix.
    ====================

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • Pull nfsd updates from Bruce Fields:
    "The one new feature this time, from Anna Schumaker, is READ_PLUS,
    which has the same arguments as READ but allows the server to return
    an array of data and hole extents.

    Otherwise it's a lot of cleanup and bugfixes"

    * tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linux: (43 commits)
    NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copy
    SUNRPC: fix copying of multiple pages in gss_read_proxy_verf()
    sunrpc: raise kernel RPC channel buffer size
    svcrdma: fix bounce buffers for unaligned offsets and multiple pages
    nfsd: remove unneeded break
    net/sunrpc: Fix return value for sysctl sunrpc.transports
    NFSD: Encode a full READ_PLUS reply
    NFSD: Return both a hole and a data segment
    NFSD: Add READ_PLUS hole segment encoding
    NFSD: Add READ_PLUS data support
    NFSD: Hoist status code encoding into XDR encoder functions
    NFSD: Map nfserr_wrongsec outside of nfsd_dispatch
    NFSD: Remove the RETURN_STATUS() macro
    NFSD: Call NFSv2 encoders on error returns
    NFSD: Fix .pc_release method for NFSv2
    NFSD: Remove vestigial typedefs
    NFSD: Refactor nfsd_dispatch() error paths
    NFSD: Clean up nfsd_dispatch() variables
    NFSD: Clean up stale comments in nfsd_dispatch()
    NFSD: Clean up switch statement in nfsd_dispatch()
    ...

    Linus Torvalds
     
  • Pull 9p updates from Dominique Martinet:
    "A couple of small fixes (loff_t overflow on 32bit, syzbot
    uninitialized variable warning) and code cleanup (xen)"

    * tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux:
    net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid
    9p/xen: Fix format argument warning
    9P: Cast to loff_t before multiplying

    Linus Torvalds
     

22 Oct, 2020

4 commits

  • Similar to 7980d2eabde8 ("ipvs: clear skb->tstamp in forwarding path").
    fq qdisc requires tstamp to be cleared in forwarding path.

    Fixes: 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths")
    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • "ip addr show" command execute error when we have a physical
    network card with a large number of VFs

    The return value of if_nlmsg_size() in rtnl_calcit() will exceed
    range of u16 data type when any network cards has a larger number of
    VFs. rtnl_vfinfo_size() will significant increase needed dump size when
    the value of num_vfs is larger.

    Eventually we get a wrong value of min_ifinfo_dump_size because of overflow
    which decides the memory size needed by netlink dump and netlink_dump()
    will return -EMSGSIZE because of not enough memory was allocated.

    So fix it by promoting min_dump_alloc data type to u32 to
    avoid whole netlink message size overflow and it's also align
    with the data type of struct netlink_callback{}.min_dump_alloc
    which is assigned by return value of rtnl_calcit()

    Signed-off-by: Di Zhu
    Link: https://lore.kernel.org/r/20201021020053.1401-1-zhudi21@huawei.com
    Signed-off-by: Jakub Kicinski

    Di Zhu
     
  • Based on the discussion in [0], update the bpf_redirect_neigh() helper to
    accept an optional parameter specifying the nexthop information. This makes
    it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without
    incurring a duplicate FIB lookup - since the FIB lookup helper will return
    the nexthop information even if no neighbour is present, this can simply
    be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns
    BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen.

    [0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Daniel Borkmann
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • Pull ceph updates from Ilya Dryomov:

    - a patch that removes crush_workspace_mutex (myself). CRUSH
    computations are no longer serialized and can run in parallel.

    - a couple new filesystem client metrics for "ceph fs top" command
    (Xiubo Li)

    - a fix for a very old messenger bug that affected the filesystem,
    marked for stable (myself)

    - assorted fixups and cleanups throughout the codebase from Jeff and
    others.

    * tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits)
    libceph: clear con->out_msg on Policy::stateful_server faults
    libceph: format ceph_entity_addr nonces as unsigned
    libceph: fix ENTITY_NAME format suggestion
    libceph: move a dout in queue_con_delay()
    ceph: comment cleanups and clarifications
    ceph: break up send_cap_msg
    ceph: drop separate mdsc argument from __send_cap
    ceph: promote to unsigned long long before shifting
    ceph: don't SetPageError on readpage errors
    ceph: mark ceph_fmt_xattr() as printf-like for better type checking
    ceph: fold ceph_update_writeable_page into ceph_write_begin
    ceph: fold ceph_sync_writepages into writepage_nounlock
    ceph: fold ceph_sync_readpages into ceph_readpage
    ceph: don't call ceph_update_writeable_page from page_mkwrite
    ceph: break out writeback of incompatible snap context to separate function
    ceph: add a note explaining session reject error string
    libceph: switch to the new "osd blocklist add" command
    libceph, rbd, ceph: "blacklist" -> "blocklist"
    ceph: have ceph_writepages_start call pagevec_lookup_range_tag
    ceph: use kill_anon_super helper
    ...

    Linus Torvalds
     

21 Oct, 2020

13 commits

  • Like TCP, MPTCP cannot be compiled as a module. Obviously, MPTCP IPv6'
    support also depends on CONFIG_IPV6. But not all functions from IPv6
    code are exported.

    To simplify the code and reduce modifications outside MPTCP, it was
    decided from the beginning to support MPTCP with IPv6 only if
    CONFIG_IPV6 was built inlined. That's also why CONFIG_MPTCP_IPV6 was
    created. More modifications are needed to support CONFIG_IPV6=m.

    Even if it was not explicit, until recently, we were forcing CONFIG_IPV6
    to be built-in because we had "select IPV6" in Kconfig. Now that we have
    "depends on IPV6", we have to explicitly set "IPV6=y" to force
    CONFIG_IPV6 not to be built as a module.

    In other words, we can now only have CONFIG_MPTCP_IPV6=y if
    CONFIG_IPV6=y.

    Note that the new dependency might hide the fact IPv6 is not supported
    in MPTCP even if we have CONFIG_IPV6=m. But selecting IPV6 like we did
    before was forcing it to be built-in while it was maybe not what the
    user wants.

    Reported-by: Geert Uytterhoeven
    Fixes: 010b430d5df5 ("mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it")
    Signed-off-by: Matthieu Baerts
    Link: https://lore.kernel.org/r/20201021105154.628257-1-matthieu.baerts@tessares.net
    Signed-off-by: Jakub Kicinski

    Matthieu Baerts
     
  • mpls_iptunnel is used only for mpls encapsuation, and if encaplusated
    packet is larger than MTU we need mpls_gso for segmentation.

    Signed-off-by: Alexander Ovechkin
    Acked-by: Dmitry Yakunin
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/20201020114333.26866-1-ovov@yandex-team.ru
    Signed-off-by: Jakub Kicinski

    Alexander Ovechkin
     
  • the following command

    # tc action add action tunnel_key \
    > set src_ip 2001:db8::1 dst_ip 2001:db8::2 id 10 erspan_opts 1:6789:0:0

    generates the following splat:

    BUG: KASAN: slab-out-of-bounds in tunnel_key_copy_opts+0xcc9/0x1010 [act_tunnel_key]
    Write of size 4 at addr ffff88813f5f1cc8 by task tc/873

    CPU: 2 PID: 873 Comm: tc Not tainted 5.9.0+ #282
    Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
    Call Trace:
    dump_stack+0x99/0xcb
    print_address_description.constprop.7+0x1e/0x230
    kasan_report.cold.13+0x37/0x7c
    tunnel_key_copy_opts+0xcc9/0x1010 [act_tunnel_key]
    tunnel_key_init+0x160c/0x1f40 [act_tunnel_key]
    tcf_action_init_1+0x5b5/0x850
    tcf_action_init+0x15d/0x370
    tcf_action_add+0xd9/0x2f0
    tc_ctl_action+0x29b/0x3a0
    rtnetlink_rcv_msg+0x341/0x8d0
    netlink_rcv_skb+0x120/0x380
    netlink_unicast+0x439/0x630
    netlink_sendmsg+0x719/0xbf0
    sock_sendmsg+0xe2/0x110
    ____sys_sendmsg+0x5ba/0x890
    ___sys_sendmsg+0xe9/0x160
    __sys_sendmsg+0xd3/0x170
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f872a96b338
    Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55
    RSP: 002b:00007ffffe367518 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 000000005f8f5aed RCX: 00007f872a96b338
    RDX: 0000000000000000 RSI: 00007ffffe367580 RDI: 0000000000000003
    RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000001c
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000001
    R13: 0000000000686760 R14: 0000000000000601 R15: 0000000000000000

    Allocated by task 873:
    kasan_save_stack+0x19/0x40
    __kasan_kmalloc.constprop.7+0xc1/0xd0
    __kmalloc+0x151/0x310
    metadata_dst_alloc+0x20/0x40
    tunnel_key_init+0xfff/0x1f40 [act_tunnel_key]
    tcf_action_init_1+0x5b5/0x850
    tcf_action_init+0x15d/0x370
    tcf_action_add+0xd9/0x2f0
    tc_ctl_action+0x29b/0x3a0
    rtnetlink_rcv_msg+0x341/0x8d0
    netlink_rcv_skb+0x120/0x380
    netlink_unicast+0x439/0x630
    netlink_sendmsg+0x719/0xbf0
    sock_sendmsg+0xe2/0x110
    ____sys_sendmsg+0x5ba/0x890
    ___sys_sendmsg+0xe9/0x160
    __sys_sendmsg+0xd3/0x170
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The buggy address belongs to the object at ffff88813f5f1c00
    which belongs to the cache kmalloc-256 of size 256
    The buggy address is located 200 bytes inside of
    256-byte region [ffff88813f5f1c00, ffff88813f5f1d00)
    The buggy address belongs to the page:
    page:0000000011b48a19 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13f5f0
    head:0000000011b48a19 order:1 compound_mapcount:0
    flags: 0x17ffffc0010200(slab|head)
    raw: 0017ffffc0010200 0000000000000000 0000000d00000001 ffff888107c43400
    raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff88813f5f1b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88813f5f1c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >ffff88813f5f1c80: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc
    ^
    ffff88813f5f1d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88813f5f1d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

    using IPv6 tunnels, act_tunnel_key allocates a fixed amount of memory for
    the tunnel metadata, but then it expects additional bytes to store tunnel
    specific metadata with tunnel_key_copy_opts().

    Fix the arguments of __ipv6_tun_set_dst(), so that 'md_size' contains the
    size previously computed by tunnel_key_get_opts_len(), like it's done for
    IPv4 tunnels.

    Fixes: 0ed5269f9e41 ("net/sched: add tunnel option support to act_tunnel_key")
    Reported-by: Shuang Li
    Signed-off-by: Davide Caratti
    Acked-by: Cong Wang
    Link: https://lore.kernel.org/r/36ebe969f6d13ff59912d6464a4356fe6f103766.1603231100.git.dcaratti@redhat.com
    Signed-off-by: Jakub Kicinski

    Davide Caratti
     
  • We need to jump to the "err_out_locked" label when
    tcf_gate_get_entries() fails. Otherwise, tc_setup_flow_action() exits
    with ->tcfa_lock still held.

    Fixes: d29bdd69ecdd ("net: schedule: add action gate offloading")
    Signed-off-by: Guillaume Nault
    Acked-by: Cong Wang
    Link: https://lore.kernel.org/r/12f60e385584c52c22863701c0185e40ab08a7a7.1603207948.git.gnault@redhat.com
    Signed-off-by: Jakub Kicinski

    Guillaume Nault
     
  • MPTCP_IPV6 selects IPV6, thus enabling an optional feature the user may
    not want to enable. Fix this by making MPTCP_IPV6 depend on IPV6, like
    is done for all other IPv6 features.

    Fixes: f870fa0b5768842c ("mptcp: Add MPTCP socket stubs")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Matthieu Baerts
    Link: https://lore.kernel.org/r/20201020073839.29226-1-geert@linux-m68k.org
    Signed-off-by: Jakub Kicinski

    Geert Uytterhoeven
     
  • Check that the NFC_ATTR_FIRMWARE_NAME attributes are provided by
    the netlink client prior to accessing them.This prevents potential
    unhandled NULL pointer dereference exceptions which can be triggered
    by malicious user-mode programs, if they omit one or both of these
    attributes.

    Similar to commit a0323b979f81 ("nfc: Ensure presence of required attributes in the activate_target handler").

    Fixes: 9674da8759df ("NFC: Add firmware upload netlink command")
    Signed-off-by: Defang Bo
    Link: https://lore.kernel.org/r/1603107538-4744-1-git-send-email-bodefang@126.com
    Signed-off-by: Jakub Kicinski

    Defang Bo
     
  • MPTCP_KUNIT_TESTS selects MPTCP, thus enabling an optional feature the
    user may not want to enable. Fix this by making the test depend on
    MPTCP instead.

    Fixes: a00a582203dbc43e ("mptcp: move crypto test to KUNIT")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Matthieu Baerts
    Link: https://lore.kernel.org/r/20201019113240.11516-1-geert@linux-m68k.org
    Signed-off-by: Jakub Kicinski

    Geert Uytterhoeven
     
  • Move mptcp_options_received's port initialization from
    mptcp_parse_option to mptcp_get_options, put it together with
    the other fields initializations of mptcp_options_received.

    Signed-off-by: Geliang Tang
    Reviewed-by: Matthieu Baerts
    Signed-off-by: Jakub Kicinski

    Geliang Tang
     
  • Initialize mptcp_options_received's ahmac to zero, otherwise it
    will be a random number when receiving ADD_ADDR suboption with echo-flag=1.

    Fixes: 3df523ab582c5 ("mptcp: Add ADD_ADDR handling")
    Signed-off-by: Geliang Tang
    Reviewed-by: Matthieu Baerts
    Signed-off-by: Jakub Kicinski

    Geliang Tang
     
  • Need to use the udp header type and not tcp.

    Fixes: 9c26ba9b1f45 ("net/sched: act_ct: Instantiate flow table entry actions")
    Signed-off-by: Roi Dayan
    Reviewed-by: Paul Blakey
    Link: https://lore.kernel.org/r/20201019090244.3015186-1-roid@nvidia.com
    Signed-off-by: Jakub Kicinski

    Roi Dayan
     
  • Pull NFS client updates from Anna Schumaker:
    "Stable Fixes:
    - Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+
    - Fix nfs_path in case of a rename retry
    - Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag

    New features and improvements:
    - Replace dprintk() calls with tracepoints
    - Make cache consistency bitmap dynamic
    - Added support for the NFS v4.2 READ_PLUS operation
    - Improvements to net namespace uniquifier

    Other bugfixes and cleanups:
    - Remove redundant clnt pointer
    - Don't update timeout values on connection resets
    - Remove redundant tracepoints
    - Various cleanups to comments
    - Fix oops when trying to use copy_file_range with v4.0 source server
    - Improvements to flexfiles mirrors
    - Add missing 'local_lock=posix' mount option"

    * tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (55 commits)
    NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag
    NFSv4: Fix up RCU annotations for struct nfs_netns_client
    NFS: Only reference user namespace from nfs4idmap struct instead of cred
    nfs: add missing "posix" local_lock constant table definition
    NFSv4: Use the net namespace uniquifier if it is set
    NFSv4: Clean up initialisation of uniquified client id strings
    NFS: Decode a full READ_PLUS reply
    SUNRPC: Add an xdr_align_data() function
    NFS: Add READ_PLUS hole segment decoding
    SUNRPC: Add the ability to expand holes in data pages
    SUNRPC: Split out _shift_data_right_tail()
    SUNRPC: Split out xdr_realign_pages() from xdr_align_pages()
    NFS: Add READ_PLUS data segment support
    NFS: Use xdr_page_pos() in NFSv4 decode_getacl()
    SUNRPC: Implement a xdr_page_pos() function
    SUNRPC: Split out a function for setting current page
    NFS: fix nfs_path in case of a rename retry
    fs: nfs: return per memcg count for xattr shrinkers
    NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE
    nfs: remove incorrect fallthrough label
    ...

    Linus Torvalds
     
  • When the passed token is longer than 4032 bytes, the remaining part
    of the token must be copied from the rqstp->rq_arg.pages. But the
    copy must make sure it happens in a consecutive way.

    With the existing code, the first memcpy copies 'length' bytes from
    argv->iobase, but since the header is in front, this never fills the
    whole first page of in_token->pages.

    The mecpy in the loop copies the following bytes, but starts writing at
    the next page of in_token->pages. This leaves the last bytes of page 0
    unwritten.

    Symptoms were that users with many groups were not able to access NFS
    exports, when using Active Directory as the KDC.

    Signed-off-by: Martijn de Gouw
    Fixes: 5866efa8cbfb "SUNRPC: Fix svcauth_gss_proxy_init()"
    Signed-off-by: J. Bruce Fields

    Martijn de Gouw
     
  • Its possible that using AUTH_SYS and mountd manage-gids option a
    user may hit the 8k RPC channel buffer limit. This have been observed
    on field, causing unanswered RPCs on clients after mountd fails to
    write on channel :

    rpc.mountd[11231]: auth_unix_gid: error writing reply

    Userland nfs-utils uses a buffer size of 32k (RPC_CHAN_BUF_SIZE), so
    lets match those two.

    Signed-off-by: Roberto Bergantinos Corpas
    Signed-off-by: J. Bruce Fields

    Roberto Bergantinos Corpas
     

20 Oct, 2020

7 commits

  • This patch fixes the issue due to:

    BUG: KASAN: slab-out-of-bounds in nft_flow_rule_create+0x622/0x6a2
    net/netfilter/nf_tables_offload.c:40
    Read of size 8 at addr ffff888103910b58 by task syz-executor227/16244

    The error happens when expr->ops is accessed early on before performing the boundary check and after nft_expr_next() moves the expr to go out-of-bounds.

    This patch checks the boundary condition before expr->ops that fixes the slab-out-of-bounds Read issue.

    Add nft_expr_more() and use it to fix this problem.

    Signed-off-by: Saeed Mirzamohammadi
    Signed-off-by: Pablo Neira Ayuso

    Saeed Mirzamohammadi
     
  • Fixes an error causing small packets to get dropped. skb_ensure_writable
    expects the second parameter to be a length in the ethernet payload.=20
    If we want to write the ethernet header (src, dst), we should pass 0.
    Otherwise, packets with small payloads (< ETH_ALEN) will get dropped.

    Fixes: c1a831167901 ("netfilter: bridge: convert skb_make_writable to skb_ensure_writable")
    Signed-off-by: Timothée COCAULT
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Timothée COCAULT
     
  • Fragmented ndisc packets assembled in netfilter not dropped as specified
    in RFC 6980, section 5. This behaviour breaks TAHI IPv6 Core Conformance
    Tests v6LC.2.1.22/23, V6LC.2.2.26/27 and V6LC.2.3.18.

    Setting IP6SKB_FRAGMENTED flag during reassembly.

    References: commit b800c3b966bc ("ipv6: drop fragmented ndisc packets by default (RFC 6980)")
    Signed-off-by: Georg Kohmann
    Signed-off-by: Pablo Neira Ayuso

    Georg Kohmann
     
  • If the first packet conntrack sees after a re-register is an outgoing
    keepalive packet with no data (SEG.SEQ = SND.NXT-1), td_end is set to
    SND.NXT-1.
    When the peer correctly acknowledges SND.NXT, tcp_in_window fails
    check III (Upper bound for valid (s)ack: sack _nfct = 0 and in later conntrack iptables rules not matching.
    In cases where iptables are dropping packets that do not match
    conntrack rules this can result in idle tcp connections to time out.

    v2: adjust td_end when getting the reply rather than when sending out
    the keepalive packet.

    Fixes: f94e63801ab2 ("netfilter: conntrack: reset tcp maxwin on re-register")
    Signed-off-by: Francesco Ruggeri
    Signed-off-by: Pablo Neira Ayuso

    Francesco Ruggeri
     
  • Outputting client,virtual,dst addresses info when tcp state changes,
    which makes the connection debug more clear

    Signed-off-by: longguang.yue
    Acked-by: Julian Anastasov
    Signed-off-by: Pablo Neira Ayuso

    longguang.yue
     
  • While insertion of 16k nexthops all using the same netdev ('dummy10')
    takes less than a second, deletion takes about 130 seconds:

    # time -p ip -b nexthop.batch
    real 0.29
    user 0.01
    sys 0.15

    # time -p ip link set dev dummy10 down
    real 131.03
    user 0.06
    sys 0.52

    This is because of repeated calls to synchronize_rcu() whenever a
    nexthop is removed from a nexthop group:

    # /usr/share/bcc/tools/offcputime -p `pgrep -nx ip` -K
    ...
    b'finish_task_switch'
    b'schedule'
    b'schedule_timeout'
    b'wait_for_completion'
    b'__wait_rcu_gp'
    b'synchronize_rcu.part.0'
    b'synchronize_rcu'
    b'__remove_nexthop'
    b'remove_nexthop'
    b'nexthop_flush_dev'
    b'nh_netdev_event'
    b'raw_notifier_call_chain'
    b'call_netdevice_notifiers_info'
    b'__dev_notify_flags'
    b'dev_change_flags'
    b'do_setlink'
    b'__rtnl_newlink'
    b'rtnl_newlink'
    b'rtnetlink_rcv_msg'
    b'netlink_rcv_skb'
    b'rtnetlink_rcv'
    b'netlink_unicast'
    b'netlink_sendmsg'
    b'____sys_sendmsg'
    b'___sys_sendmsg'
    b'__sys_sendmsg'
    b'__x64_sys_sendmsg'
    b'do_syscall_64'
    b'entry_SYSCALL_64_after_hwframe'
    - ip (277)
    126554955

    Since nexthops are always deleted under RTNL, synchronize_net() can be
    used instead. It will call synchronize_rcu_expedited() which only blocks
    for several microseconds as opposed to multiple milliseconds like
    synchronize_rcu().

    With this patch deletion of 16k nexthops takes less than a second:

    # time -p ip link set dev dummy10 down
    real 0.12
    user 0.00
    sys 0.04

    Tested with fib_nexthops.sh which includes torture tests that prompted
    the initial change:

    # ./fib_nexthops.sh
    ...
    Tests passed: 134
    Tests failed: 0

    Fixes: 90f33bffa382 ("nexthops: don't modify published nexthop groups")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jesse Brandeburg
    Reviewed-by: David Ahern
    Acked-by: Nikolay Aleksandrov
    Link: https://lore.kernel.org/r/20201016172914.643282-1-idosch@idosch.org
    Signed-off-by: Jakub Kicinski

    Ido Schimmel
     
  • The Marvell 88E6060 uses tag_trailer.c and the KSZ8795, KSZ9477 and
    KSZ9893 switches also use tail tags.

    Fixes: 7a6ffe764be3 ("net: dsa: point out the tail taggers")
    Signed-off-by: Christian Eggers
    Reviewed-by: Florian Fainelli
    Link: https://lore.kernel.org/r/20201016171603.10587-1-ceggers@arri.de
    Signed-off-by: Jakub Kicinski

    Christian Eggers
     

19 Oct, 2020

2 commits

  • dev->unlink_list is reused unless dev is deleted.
    So, list_del() should not be used.
    Due to using list_del(), dev->unlink_list can't be reused so that
    dev->nested_level update logic doesn't work.
    In order to fix this bug, list_del_init() should be used instead
    of list_del().

    Test commands:
    ip link add bond0 type bond
    ip link add bond1 type bond
    ip link set bond0 master bond1
    ip link set bond0 nomaster
    ip link set bond1 master bond0
    ip link set bond1 nomaster

    Splat looks like:
    [ 255.750458][ T1030] ============================================
    [ 255.751967][ T1030] WARNING: possible recursive locking detected
    [ 255.753435][ T1030] 5.9.0-rc8+ #772 Not tainted
    [ 255.754553][ T1030] --------------------------------------------
    [ 255.756047][ T1030] ip/1030 is trying to acquire lock:
    [ 255.757304][ T1030] ffff88811782a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync_multiple+0xc2/0x150
    [ 255.760056][ T1030]
    [ 255.760056][ T1030] but task is already holding lock:
    [ 255.761862][ T1030] ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding]
    [ 255.764581][ T1030]
    [ 255.764581][ T1030] other info that might help us debug this:
    [ 255.766645][ T1030] Possible unsafe locking scenario:
    [ 255.766645][ T1030]
    [ 255.768566][ T1030] CPU0
    [ 255.769415][ T1030] ----
    [ 255.770259][ T1030] lock(&dev_addr_list_lock_key/1);
    [ 255.771629][ T1030] lock(&dev_addr_list_lock_key/1);
    [ 255.772994][ T1030]
    [ 255.772994][ T1030] *** DEADLOCK ***
    [ 255.772994][ T1030]
    [ 255.775091][ T1030] May be due to missing lock nesting notation
    [ 255.775091][ T1030]
    [ 255.777182][ T1030] 2 locks held by ip/1030:
    [ 255.778299][ T1030] #0: ffffffffb1f63250 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2e4/0x8b0
    [ 255.780600][ T1030] #1: ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding]
    [ 255.783411][ T1030]
    [ 255.783411][ T1030] stack backtrace:
    [ 255.784874][ T1030] CPU: 7 PID: 1030 Comm: ip Not tainted 5.9.0-rc8+ #772
    [ 255.786595][ T1030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    [ 255.789030][ T1030] Call Trace:
    [ 255.789850][ T1030] dump_stack+0x99/0xd0
    [ 255.790882][ T1030] __lock_acquire.cold.71+0x166/0x3cc
    [ 255.792285][ T1030] ? register_lock_class+0x1a30/0x1a30
    [ 255.793619][ T1030] ? rcu_read_lock_sched_held+0x91/0xc0
    [ 255.794963][ T1030] ? rcu_read_lock_bh_held+0xa0/0xa0
    [ 255.796246][ T1030] lock_acquire+0x1b8/0x850
    [ 255.797332][ T1030] ? dev_mc_sync_multiple+0xc2/0x150
    [ 255.798624][ T1030] ? bond_enslave+0x3d4d/0x43e0 [bonding]
    [ 255.800039][ T1030] ? check_flags+0x50/0x50
    [ 255.801143][ T1030] ? lock_contended+0xd80/0xd80
    [ 255.802341][ T1030] _raw_spin_lock_nested+0x2e/0x70
    [ 255.803592][ T1030] ? dev_mc_sync_multiple+0xc2/0x150
    [ 255.804897][ T1030] dev_mc_sync_multiple+0xc2/0x150
    [ 255.806168][ T1030] bond_enslave+0x3d58/0x43e0 [bonding]
    [ 255.807542][ T1030] ? __lock_acquire+0xe53/0x51b0
    [ 255.808824][ T1030] ? bond_update_slave_arr+0xdc0/0xdc0 [bonding]
    [ 255.810451][ T1030] ? check_chain_key+0x236/0x5e0
    [ 255.811742][ T1030] ? mutex_is_locked+0x13/0x50
    [ 255.812910][ T1030] ? rtnl_is_locked+0x11/0x20
    [ 255.814061][ T1030] ? netdev_master_upper_dev_get+0xf/0x120
    [ 255.815553][ T1030] do_setlink+0x94c/0x3040
    [ ... ]

    Reported-by: syzbot+4a0f7bc34e3997a6c7df@syzkaller.appspotmail.com
    Fixes: 1fc70edb7d7b ("net: core: add nested_level variable in net_device")
    Signed-off-by: Taehee Yoo
    Link: https://lore.kernel.org/r/20201015162606.9377-1-ap420073@gmail.com
    Signed-off-by: Jakub Kicinski

    Taehee Yoo
     
  • The flow_lookup() function uses per CPU variables, which must be called
    with BH disabled. However, this is fine in the general NAPI use case
    where the local BH is disabled. But, it's also called from the netlink
    context. The below patch makes sure that even in the netlink path, the
    BH is disabled.

    In addition, u64_stats_update_begin() requires a lock to ensure one writer
    which is not ensured here. Making it per-CPU and disabling NAPI (softirq)
    ensures that there is always only one writer.

    Fixes: eac87c413bf9 ("net: openvswitch: reorder masks array based on usage")
    Reported-by: Juri Lelli
    Signed-off-by: Eelco Chaudron
    Link: https://lore.kernel.org/r/160295903253.7789.826736662555102345.stgit@ebuild
    Signed-off-by: Jakub Kicinski

    Eelco Chaudron
     

17 Oct, 2020

4 commits

  • Keyu Man reported that the ICMP rate limiter could be used
    by attackers to get useful signal. Details will be provided
    in an upcoming academic publication.

    Our solution is to add some noise, so that the attackers
    no longer can get help from the predictable token bucket limiter.

    Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
    Signed-off-by: Eric Dumazet
    Reported-by: Keyu Man
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     
  • In commit 16ad3f4022bb
    ("tipc: introduce variable window congestion control"), we applied
    the algorithm to select window size from minimum window to the
    configured maximum window for unicast link, and, besides we chose
    to keep the window size for broadcast link unchanged and equal (i.e
    fix window 50)

    However, when setting maximum window variable via command, the window
    variable was re-initialized to unexpect value (i.e 32).

    We fix this by updating the fix window for broadcast as we stated.

    Fixes: 16ad3f4022bb ("tipc: introduce variable window congestion control")
    Acked-by: Jon Maloy
    Signed-off-by: Hoang Huu Le
    Signed-off-by: Jakub Kicinski

    Hoang Huu Le
     
  • The queue limit of the broadcast link is being calculated base on initial
    MTU. However, when MTU value changed (e.g manual changing MTU on NIC
    device, MTU negotiation etc.,) we do not re-calculate queue limit.
    This gives throughput does not reflect with the change.

    So fix it by calling the function to re-calculate queue limit of the
    broadcast link.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Huu Le
    Signed-off-by: Jakub Kicinski

    Hoang Huu Le
     
  • This was discovered using O_DIRECT at the client side, with small
    unaligned file offsets or IOs that span multiple file pages.

    Fixes: e248aa7be86 ("svcrdma: Remove max_sge check at connect time")
    Signed-off-by: Dan Aloni
    Signed-off-by: J. Bruce Fields

    Dan Aloni