12 Dec, 2020

7 commits

  • We hit this issue in our internal test. When enabling generic kasan, a
    kfree()'d object is put into per-cpu quarantine first. If the cpu goes
    offline, object still remains in the per-cpu quarantine. If we call
    kmem_cache_destroy() now, slub will report "Objects remaining" error.

    =============================================================================
    BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200
    CPU: 3 PID: 176 Comm: cat Tainted: G B 5.10.0-rc1-00007-g4525c8781ec0-dirty #10
    Hardware name: linux,dummy-virt (DT)
    Call trace:
    dump_backtrace+0x0/0x2b0
    show_stack+0x18/0x68
    dump_stack+0xfc/0x168
    slab_err+0xac/0xd4
    __kmem_cache_shutdown+0x1e4/0x3c8
    kmem_cache_destroy+0x68/0x130
    test_version_show+0x84/0xf0
    module_attr_show+0x40/0x60
    sysfs_kf_seq_show+0x128/0x1c0
    kernfs_seq_show+0xa0/0xb8
    seq_read+0x1f0/0x7e8
    kernfs_fop_read+0x70/0x338
    vfs_read+0xe4/0x250
    ksys_read+0xc8/0x180
    __arm64_sys_read+0x44/0x58
    el0_svc_common.constprop.0+0xac/0x228
    do_el0_svc+0x38/0xa0
    el0_sync_handler+0x170/0x178
    el0_sync+0x174/0x180
    INFO: Object 0x(____ptrval____) @offset=15848
    INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172
    stack_trace_save+0x9c/0xd0
    set_track+0x64/0xf0
    alloc_debug_processing+0x104/0x1a0
    ___slab_alloc+0x628/0x648
    __slab_alloc.isra.0+0x2c/0x58
    kmem_cache_alloc+0x560/0x588
    test_version_show+0x98/0xf0
    module_attr_show+0x40/0x60
    sysfs_kf_seq_show+0x128/0x1c0
    kernfs_seq_show+0xa0/0xb8
    seq_read+0x1f0/0x7e8
    kernfs_fop_read+0x70/0x338
    vfs_read+0xe4/0x250
    ksys_read+0xc8/0x180
    __arm64_sys_read+0x44/0x58
    el0_svc_common.constprop.0+0xac/0x228
    kmem_cache_destroy test_module_slab: Slab cache still has objects

    Register a cpu hotplug function to remove all objects in the offline
    per-cpu quarantine when cpu is going offline. Set a per-cpu variable to
    indicate this cpu is offline.

    [qiang.zhang@windriver.com: fix slab double free when cpu-hotplug]
    Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com

    Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com
    Signed-off-by: Kuan-Ying Lee
    Signed-off-by: Zqiang
    Suggested-by: Dmitry Vyukov
    Reported-by: Guangye Yang
    Reviewed-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Matthias Brugger
    Cc: Nicholas Tang
    Cc: Miles Chen
    Cc: Qian Cai
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kuan-Ying Lee
     
  • kernel/elfcore.c only contains weak symbols, which triggers a bug with
    clang in combination with recordmcount:

    Cannot find symbol for section 2: .text.
    kernel/elfcore.o: failed

    Move the empty stubs into linux/elfcore.h as inline functions. As only
    two architectures use these, just use the architecture specific Kconfig
    symbols to key off the declaration.

    Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org
    Signed-off-by: Arnd Bergmann
    Cc: Nathan Chancellor
    Cc: Nick Desaulniers
    Cc: Barret Rhoden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • There is only one function in init/initramfs.c that is in the .text
    section, and it is marked __weak. When building with clang-12 and the
    integrated assembler, this leads to a bug with recordmcount:

    ./scripts/recordmcount "init/initramfs.o"
    Cannot find symbol for section 2: .text.
    init/initramfs.o: failed

    I'm not quite sure what exactly goes wrong, but I notice that this
    function is only ever called from an __init function, and normally
    inlined. Marking it __init as well is clearly correct and it leads to
    recordmcount no longer complaining.

    Link: https://lkml.kernel.org/r/20201204165742.3815221-1-arnd@kernel.org
    Signed-off-by: Arnd Bergmann
    Cc: Nathan Chancellor
    Cc: Nick Desaulniers
    Cc: Barret Rhoden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • genksyms does not know or care about the _Static_assert() built-in, and
    sometimes falls back to ignoring the later symbols, which causes
    undefined behavior such as

    WARNING: modpost: EXPORT symbol "ethtool_set_ethtool_phy_ops" [vmlinux] version generation failed, symbol will not be versioned.
    ld: net/ethtool/common.o: relocation R_AARCH64_ABS32 against `__crc_ethtool_set_ethtool_phy_ops' can not be used when making a shared object
    net/ethtool/common.o:(_ftrace_annotated_branch+0x0): dangerous relocation: unsupported relocation

    Redefine static_assert for genksyms to avoid that.

    Link: https://lkml.kernel.org/r/20201203230955.1482058-1-arnd@kernel.org
    Signed-off-by: Arnd Bergmann
    Suggested-by: Ard Biesheuvel
    Cc: Masahiro Yamada
    Cc: Michal Marek
    Cc: Kees Cook
    Cc: Rikard Falkeborn
    Cc: Marco Elver
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • With extra warnings enabled, clang complains about the redundant
    -mhard-float argument:

    clang: error: argument unused during compilation: '-mhard-float' [-Werror,-Wunused-command-line-argument]

    Move this into the gcc-only part of the Makefile.

    Link: https://lkml.kernel.org/r/20201203223652.1320700-1-arnd@kernel.org
    Fixes: 4185b3b92792 ("selftests/fpu: Add an FPU selftest")
    Signed-off-by: Arnd Bergmann
    Cc: Nathan Chancellor
    Cc: Nick Desaulniers
    Cc: Petteri Aimonen
    Cc: Borislav Petkov
    Cc: Arnd Bergmann
    Cc: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • When we try to visit the pagemap of a tagged userspace pointer, we find
    that the start_vaddr is not correct because of the tag.
    To fix it, we should untag the userspace pointers in pagemap_read().

    I tested with 5.10-rc4 and the issue remains.

    Explanation from Catalin in [1]:

    "Arguably, that's a user-space bug since tagged file offsets were never
    supported. In this case it's not even a tag at bit 56 as per the arm64
    tagged address ABI but rather down to bit 47. You could say that the
    problem is caused by the C library (malloc()) or whoever created the
    tagged vaddr and passed it to this function. It's not a kernel
    regression as we've never supported it.

    Now, pagemap is a special case where the offset is usually not
    generated as a classic file offset but rather derived by shifting a
    user virtual address. I guess we can make a concession for pagemap
    (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"

    My test code is based on [2]:

    A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8

    userspace program:

    uint64 OsLayer::VirtualToPhysical(void *vaddr) {
    uint64 frame, paddr, pfnmask, pagemask;
    int pagesize = sysconf(_SC_PAGESIZE);
    off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
    int fd = open(kPagemapPath, O_RDONLY);
    ...

    if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
    int err = errno;
    string errtxt = ErrorString(err);
    if (fd >= 0)
    close(fd);
    return 0;
    }
    ...
    }

    kernel fs/proc/task_mmu.c:

    static ssize_t pagemap_read(struct file *file, char __user *buf,
    size_t count, loff_t *ppos)
    {
    ...
    src = *ppos;
    svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
    start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
    end_vaddr = mm->task_size;

    /* watch out for wraparound */
    // svpfn == 0xb400007662f54
    // (mm->task_size >> PAGE) == 0x8000000
    if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
    start_vaddr = end_vaddr;

    ret = 0;
    while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
    int len;
    unsigned long end;
    ...
    }
    ...
    }

    [1] https://lore.kernel.org/patchwork/patch/1343258/
    [2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158

    Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Cc: Alexey Dobriyan
    Cc: Andrey Konovalov
    Cc: Alexander Potapenko
    Cc: Vincenzo Frascino
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Marco Elver
    Cc: Will Deacon
    Cc: Eric W. Biederman
    Cc: Song Bao Hua (Barry Song)
    Cc: [5.4-]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     
  • Revert commit 3351b16af494 ("mm/filemap: add static for function
    __add_to_page_cache_locked") due to incompatibility with
    ALLOW_ERROR_INJECTION which result in build errors.

    Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com
    Tested-by: Justin Forbes
    Tested-by: Greg Thelen
    Acked-by: Alexei Starovoitov
    Cc: Michal Kubecek
    Cc: Alex Shi
    Cc: Souptick Joarder
    Cc: Daniel Borkmann
    Cc: Josef Bacik
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Dec, 2020

16 commits

  • Pull ktest fix from Steven Rostedt:
    "Fix issues with grub2bls in ktest.pl

    ktest.pl did not know about grub2bls that was introduced in Fedora 30,
    and now it does"

    * tag 'ktest-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
    ktest.pl: Fix incorrect reboot for grub2bls

    Linus Torvalds
     
  • Pull powerpc fix from Michael Ellerman:
    "One commit to implement copy_from_kernel_nofault_allowed(), otherwise
    copy_from_kernel_nofault() can trigger warnings when accessing bad
    addresses in some configurations.

    Thanks to Christophe Leroy and Qian Cai"

    * tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/mm: Fix KUAP warning by providing copy_from_kernel_nofault_allowed()

    Linus Torvalds
     
  • Pull namespaced fscaps fix from James Morris:
    "Fix namespaced fscaps when !CONFIG_SECURITY (Serge Hallyn)"

    * tag 'fixes-v5.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    [SECURITY] fix namespaced fscaps when !CONFIG_SECURITY

    Linus Torvalds
     
  • Pull NFS client fixes from Anna Schumaker:
    "Here are a handful more bugfixes for 5.10.

    Unfortunately, we found some problems with the new READ_PLUS operation
    that aren't easy to fix. We've decided to disable this codepath
    through a Kconfig option for now, but a series of patches going into
    5.11 will clean up the code and fix the issues at the same time. This
    seemed like the best way to go about it.

    Summary:

    - Fix array overflow when flexfiles mirroring is enabled

    - Fix rpcrdma_inline_fixup() crash with new LISTXATTRS

    - Fix 5 second delay when doing inter-server copy

    - Disable READ_PLUS by default"

    * tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
    NFS: Disable READ_PLUS by default
    NFSv4.2: Fix 5 seconds delay when doing inter server copy
    NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
    pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) IPsec compat fixes, from Dmitry Safonov.

    2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai.

    3) Fix polling in xsk sockets by using sk_poll_wait() instead of
    datagram_poll() which keys off of sk_wmem_alloc and such which xsk
    sockets do not update. From Xuan Zhuo.

    4) Missing init of rekey_data in cfgh80211, from Sara Sharon.

    5) Fix destroy of timer before init, from Davide Caratti.

    6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd
    Bergmann.

    7) Missing error return in rtm_to_fib_config() switch case, from Zhang
    Changzhong.

    8) Fix some src/dest address handling in vrf and add a testcase. From
    Stephen Suryaputra.

    9) Fix multicast handling in Seville switches driven by mscc-ocelot
    driver. From Vladimir Oltean.

    10) Fix proto value passed to skb delivery demux in udp, from Xin Long.

    11) HW pkt counters not reported correctly in enetc driver, from Claudiu
    Manoil.

    12) Fix deadlock in bridge, from Joseph Huang.

    13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET.

    14) Fix pid fetching in bpftool when there are a lot of results, from
    Andrii Nakryiko.

    15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso.

    16) Various stymmac fixes, from Fugang Duan.

    17) Fix null deref in tipc, from Cengiz Can.

    18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric
    Dumazet.

    19) Revert a geneve change that likely isnt necessary, from Jakub
    Kicinski.

    20) Avoid premature rx buffer reuse in various Intel driversm from Björn
    Töpel.

    21) retain EcT bits during TIS reflection in tcp, from Wei Wang.

    22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell.

    23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault

    24) Fix propagation of 32-bit signed bounds in bpf verifier and add test
    cases, from Alexei Starovoitov.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
    selftests: fix poll error in udpgro.sh
    selftests/bpf: Fix "dubious pointer arithmetic" test
    selftests/bpf: Fix array access with signed variable test
    selftests/bpf: Add test for signed 32-bit bound check bug
    bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
    MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver
    net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
    net/mlx4_en: Handle TX error CQE
    net/mlx4_en: Avoid scheduling restart task if it is already running
    tcp: fix cwnd-limited bug for TSO deferral where we send nothing
    net: flow_offload: Fix memory leak for indirect flow block
    tcp: Retain ECT bits for tos reflection
    ethtool: fix stack overflow in ethnl_parse_bitset()
    e1000e: fix S0ix flow to allow S0i3.2 subset entry
    ice: avoid premature Rx buffer reuse
    ixgbe: avoid premature Rx buffer reuse
    i40e: avoid premature Rx buffer reuse
    igb: avoid transmit queue timeout in xdp path
    igb: use xdp_do_flush
    igb: skb add metasize for xdp
    ...

    Linus Torvalds
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2020-12-10

    The following pull-request contains BPF updates for your *net* tree.

    We've added 21 non-merge commits during the last 12 day(s) which contain
    a total of 21 files changed, 163 insertions(+), 88 deletions(-).

    The main changes are:

    1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.

    2) Fix ring_buffer__poll() return value, from Andrii.

    3) Fix race in lwt_bpf, from Cong.

    4) Fix test_offload, from Toke.

    5) Various xsk fixes.

    Please consider pulling these changes from:

    git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

    Thanks a lot!

    Also thanks to reporters, reviewers and testers of commits in this pull-request:

    Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
    Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We've been seeing failures with xfstests generic/091 and generic/263
    when using READ_PLUS. I've made some progress on these issues, and the
    tests fail later on but still don't pass. Let's disable READ_PLUS by
    default until we can work out what is going on.

    Signed-off-by: Anna Schumaker

    Anna Schumaker
     
  • Since commit b4868b44c5628 ("NFSv4: Wait for stateid updates after
    CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
    seconds delay regardless of the size of the copy. The delay is from
    nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
    fails because the seqid in both nfs4_state and nfs4_stateid are 0.

    Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
    until after the call to update_open_stateid, to indicate this is the 1st
    open. This fix is part of a 2 patches, the other patch is the fix in the
    source server to return the stateid for COPY_NOTIFY request with seqid 1
    instead of 0.

    Fixes: ce0887ac96d3 ("NFSD add nfs4 inter ssc to nfsd4_copy")
    Signed-off-by: Dai Ngo
    Signed-off-by: Anna Schumaker

    Dai Ngo
     
  • By switching to an XFS-backed export, I am able to reproduce the
    ibcomp worker crash on my client with xfstests generic/013.

    For the failing LISTXATTRS operation, xdr_inline_pages() is called
    with page_len=12 and buflen=128.

    - When ->send_request() is called, rpcrdma_marshal_req() does not
    set up a Reply chunk because buflen is smaller than the inline
    threshold. Thus rpcrdma_convert_iovs() does not get invoked at
    all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
    on the receive buffer.

    - During reply processing, rpcrdma_inline_fixup() tries to copy
    received data into rq_rcv_buf->pages because page_len is positive.
    But there are no receive pages because rpcrdma_marshal_req() never
    allocated them.

    The result is that the ibcomp worker faults and dies. Sometimes that
    causes a visible crash, and sometimes it results in a transport hang
    without other symptoms.

    RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
    should eventually be fixed or replaced. However, my preference is
    that upper-layer operations should explicitly allocate their receive
    buffers (using GFP_KERNEL) when possible, rather than relying on
    XDRBUF_SPARSE_PAGES.

    Reported-by: Olga kornievskaia
    Suggested-by: Olga kornievskaia
    Fixes: c10a75145feb ("NFSv4.2: add the extended attribute proc functions.")
    Signed-off-by: Chuck Lever
    Reviewed-by: Olga kornievskaia
    Reviewed-by: Frank van der Linden
    Tested-by: Olga kornievskaia
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • The test program udpgso_bench_rx always invokes the poll()
    syscall with a timeout of 10ms. If a larger timeout is specified
    via the command line, udpgso_bench_rx is supposed to do multiple
    poll() calls till the timeout is expired or an event is received.

    Currently the poll() loop errors out after the first invocation with
    no events, and may causes self-tests failure alike:

    failed
    GRO with custom segment size ./udpgso_bench_rx: poll: 0x0 expected 0x1

    This change addresses the issue allowing the poll() loop to consume
    all the configured timeout.

    Fixes: ada641ff6ed3 ("selftests: fixes for UDP GRO")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • The verifier trace changed following a bugfix. After checking the 64-bit
    sign, only the upper bit mask is known, not bit 31. Update the test
    accordingly.

    Signed-off-by: Jean-Philippe Brucker
    Acked-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Jean-Philippe Brucker
     
  • The test fails because of a recent fix to the verifier, even though this
    program is valid. In details what happens is:

    7: (61) r1 = *(u32 *)(r0 +0)

    Load a 32-bit value, with signed bounds [S32_MIN, S32_MAX]. The bounds
    of the 64-bit value are [0, U32_MAX]...

    8: (65) if r1 s> 0xffffffff goto pc+1

    ... therefore this is always true (the operand is sign-extended).

    10: (b4) w2 = 11
    11: (6d) if r2 s> r1 goto pc+1

    When true, the 64-bit bounds become [0, 10]. The 32-bit bounds are still
    [S32_MIN, 10].

    13: (64) w1 <
    Acked-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Jean-Philippe Brucker
     
  • After a 32-bit load followed by a branch, the verifier would reduce the
    maximum bound of the register to 0x7fffffff, allowing a user to bypass
    bound checks. Ensure such a program is rejected.

    In the second test, the 64-bit compare should not sufficient to
    determine whether the signed 32-bit lower bound is 0, so the verifier
    should reject the second branch.

    Signed-off-by: Jean-Philippe Brucker
    Acked-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Jean-Philippe Brucker
     
  • The 64-bit signed bounds should not affect 32-bit signed bounds unless the
    verifier knows that upper 32-bits are either all 1s or all 0s. For example the
    register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
    since smax_value could be larger than 32-bit subregister can hold.
    The verifier refines the smax/s32_max return value from certain helpers in
    do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
    value is also bounded. When both smin and smax bounds fit into 32-bit
    subregister the verifier can propagate those bounds.

    Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
    Reported-by: Jean-Philippe Brucker
    Acked-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Alexei Starovoitov
     
  • Pull rdma fixes from Jason Gunthorpe:
    "Two user triggerable crashers and a some EFA related regressions:

    - Syzkaller found a bug in CM

    - Restore access to the GID table and fix modify_qp for EFA

    - Crasher in qedr"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    RDMA/cm: Fix an attempt to use non-valid pointer when cleaning timewait
    RDMA/core: Fix empty gid table for non IB/RoCE devices
    RDMA/efa: Use the correct current and new states in modify QP
    RDMA/qedr: iWARP invalid(zero) doorbell address fix

    Linus Torvalds
     
  • Pull media fixes from Mauro Carvalho Chehab:
    "A couple of fixes:

    - videobuf2: fix a DMABUF bug, preventing it to properly handle cache
    sync/flush

    - vidtv: an usage after free and a few sparse/smatch warning fixes

    - pulse8-cec: a duplicate free and a bug related to new firmware
    usage

    - mtk-cir: fix a regression on a clock setting"

    * tag 'media/v5.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
    media: vidtv: fix some warnings
    media: vidtv: fix kernel-doc markups
    media: [next] media: vidtv: fix a read from an object after it has been freed
    media: vb2: set cache sync hints when init buffers
    media: pulse8-cec: add support for FW v10 and up
    media: pulse8-cec: fix duplicate free at disconnect or probe error
    media: mtk-cir: fix calculation of chk period

    Linus Torvalds
     

10 Dec, 2020

17 commits

  • Add maintainers info for new Marvell Prestera Ethernet switch driver.

    Signed-off-by: Mickey Rachamim
    Signed-off-by: David S. Miller

    Mickey Rachamim
     
  • TCA_FLOWER_KEY_MPLS_OPT_LSE_LABEL is a u32 attribute (MPLS label is
    20 bits long).

    Fixes the following bug:

    $ tc filter add dev ethX ingress protocol mpls_uc \
    flower mpls lse depth 2 label 256 \
    action drop

    $ tc filter show dev ethX ingress
    filter protocol mpls_uc pref 49152 flower chain 0
    filter protocol mpls_uc pref 49152 flower chain 0 handle 0x1
    eth_type 8847
    mpls
    lse depth 2 label 0
    Signed-off-by: David S. Miller

    Guillaume Nault
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
    from Subash Abhinov Kasiviswanathan.

    2) Fix netlink dump of dynset timeouts later than 23 days.

    3) Add comment for the indirect serialization of the nft commit mutex
    with rtnl_mutex.

    4) Remove bogus check for confirmed conntrack when matching on the
    conntrack ID, from Brett Mastbergen.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Tony Nguyen says:

    ====================
    Intel Wired LAN Driver Updates 2020-12-09

    This series contains updates to igb, ixgbe, i40e, and ice drivers.

    Sven Auhagen fixes issues with igb XDP: return correct error value in XDP
    xmit back, increase header padding to include space for double VLAN, add
    an extack error when Rx buffer is too small for frame size, set metasize if
    it is set in xdp, change xdp_do_flush_map to xdp_do_flush, and update
    trans_start to avoid possible Tx timeout.

    Björn fixes an issue where an Rx buffer can be reused prematurely with
    XDP redirect for ixgbe, i40e, and ice drivers.

    The following are changes since commit 323a391a220c4a234cb1e678689d7f4c3b73f863:
    can: isotp: isotp_setsockopt(): block setsockopt on bound sockets
    and are available in the git repository at:
    git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 1GbE
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Tariq Toukan says:

    ====================
    mlx4_en fixes

    This patchset by Moshe contains fixes to the mlx4 Eth driver,
    addressing issues in restart flow.

    Patch 1 protects the restart task from being rescheduled while active.
    Please queue for -stable >= v2.6.
    Patch 2 reconstructs SQs stuck in error state, and adds prints for improved
    debuggability.
    Please queue for -stable >= v3.12.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In case error CQE was found while polling TX CQ, the QP is in error
    state and all posted WQEs will generate error CQEs without any data
    transmitted. Fix it by reopening the channels, via same method used for
    TX timeout handling.

    In addition add some more info on error CQE and WQE for debug.

    Fixes: bd2f631d7c60 ("net/mlx4_en: Notify user when TX ring in error state")
    Signed-off-by: Moshe Shemesh
    Signed-off-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Moshe Shemesh
     
  • Add restarting state flag to avoid scheduling another restart task while
    such task is already running. Change task name from watchdog_task to
    restart_task to better fit the task role.

    Fixes: 1e338db56e5a ("mlx4_en: Fix a race at restart task")
    Signed-off-by: Moshe Shemesh
    Signed-off-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Moshe Shemesh
     
  • When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
    into persistent scenarios where we have the following sequence:

    (1) ACK for full-sized skb of N*MSS arrives
    -> tcp_write_xmit() transmit full-sized skb with N*MSS
    -> move pacing release time forward
    -> exit tcp_write_xmit() because pacing time is in the future

    (2) TSQ callback or TCP internal pacing timer fires
    -> try to transmit next skb, but TSO deferral finds remainder of
    available cwnd is not big enough to trigger an immediate send
    now, so we defer sending until the next ACK.

    (3) repeat...

    So we can get into a case where we never mark ourselves as
    cwnd-limited for many seconds at a time, even with
    bulk/infinite-backlog senders, because:

    o In case (1) above, every time in tcp_write_xmit() we have enough
    cwnd to send a full-sized skb, we are not fully using the cwnd
    (because cwnd is not a multiple of the TSO skb size). So every time we
    send data, we are not cwnd limited, and so in the cwnd-limited
    tracking code in tcp_cwnd_validate() we mark ourselves as not
    cwnd-limited.

    o In case (2) above, every time in tcp_write_xmit() that we try to
    transmit the "remainder" of the cwnd but defer, we set the local
    variable is_cwnd_limited to true, but we do not send any packets, so
    sent_pkts is zero, so we don't call the cwnd-limited logic to update
    tp->is_cwnd_limited.

    Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler")
    Reported-by: Ingemar Johansson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com
    Signed-off-by: Jakub Kicinski

    Neal Cardwell
     
  • The offending commit introduces a cleanup callback that is invoked
    when the driver module is removed to clean up the tunnel device
    flow block. But it returns on the first iteration of the for loop.
    The remaining indirect flow blocks will never be freed.

    Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure")
    CC: Pablo Neira Ayuso
    Signed-off-by: Chris Mi
    Reviewed-by: Roi Dayan

    Chris Mi
     
  • For DCTCP, we have to retain the ECT bits set by the congestion control
    algorithm on the socket when reflecting syn TOS in syn-ack, in order to
    make ECN work properly.

    Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket")
    Reported-by: Alexander Duyck
    Signed-off-by: Wei Wang
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     
  • Syzbot reported a stack overflow in bitmap_from_arr32() called from
    ethnl_parse_bitset() when bitset from netlink message is longer than
    target bitmap length. While ethnl_compact_sanity_checks() makes sure that
    trailing part is all zeros (i.e. the request does not try to touch bits
    kernel does not recognize), we also need to cap change_bits to nbits so
    that we don't try to write past the prepared bitmaps.

    Fixes: 88db6d1e4f62 ("ethtool: add ethnl_parse_bitset() helper")
    Reported-by: syzbot+9d39fa49d4df294aab93@syzkaller.appspotmail.com
    Signed-off-by: Michal Kubecek
    Link: https://lore.kernel.org/r/3487ee3a98e14cd526f55b6caaa959d2dcbcad9f.1607465316.git.mkubecek@suse.cz
    Signed-off-by: Jakub Kicinski

    Michal Kubecek
     
  • Changed a configuration in the flows to align with
    architecture requirements to achieve S0i3.2 substate.

    This helps both i219V and i219LM configurations.

    Also fixed a typo in the previous commit 632fbd5eb5b0
    ("e1000e: fix S0ix flows for cable connected case").

    Fixes: 632fbd5eb5b0 ("e1000e: fix S0ix flows for cable connected case").
    Signed-off-by: Vitaly Lifshits
    Tested-by: Aaron Brown
    Signed-off-by: Tony Nguyen
    Reviewed-by: Alexander Duyck
    Signed-off-by: Mario Limonciello
    Link: https://lore.kernel.org/r/20201208185632.151052-1-mario.limonciello@dell.com
    Signed-off-by: Jakub Kicinski

    Vitaly Lifshits
     
  • The page recycle code, incorrectly, relied on that a page fragment
    could not be freed inside xdp_do_redirect(). This assumption leads to
    that page fragments that are used by the stack/XDP redirect can be
    reused and overwritten.

    To avoid this, store the page count prior invoking xdp_do_redirect().

    Fixes: efc2214b6047 ("ice: Add support for XDP")
    Reported-and-analyzed-by: Li RongQing
    Signed-off-by: Björn Töpel
    Tested-by: George Kuruvinakunnel
    Signed-off-by: Tony Nguyen

    Björn Töpel
     
  • The page recycle code, incorrectly, relied on that a page fragment
    could not be freed inside xdp_do_redirect(). This assumption leads to
    that page fragments that are used by the stack/XDP redirect can be
    reused and overwritten.

    To avoid this, store the page count prior invoking xdp_do_redirect().

    Fixes: 6453073987ba ("ixgbe: add initial support for xdp redirect")
    Reported-and-analyzed-by: Li RongQing
    Signed-off-by: Björn Töpel
    Tested-by: Sandeep Penigalapati
    Signed-off-by: Tony Nguyen

    Björn Töpel
     
  • The page recycle code, incorrectly, relied on that a page fragment
    could not be freed inside xdp_do_redirect(). This assumption leads to
    that page fragments that are used by the stack/XDP redirect can be
    reused and overwritten.

    To avoid this, store the page count prior invoking xdp_do_redirect().

    Longer explanation:

    Intel NICs have a recycle mechanism. The main idea is that a page is
    split into two parts. One part is owned by the driver, one part might
    be owned by someone else, such as the stack.

    t0: Page is allocated, and put on the Rx ring
    +---------------
    used by NIC ->| upper buffer
    (rx_buffer) +---------------
    | lower buffer
    +---------------
    page count == USHRT_MAX
    rx_buffer->pagecnt_bias == USHRT_MAX

    t1: Buffer is received, and passed to the stack (e.g.)
    +---------------
    | upper buff (skb)
    +---------------
    used by NIC ->| lower buffer
    (rx_buffer) +---------------
    page count == USHRT_MAX
    rx_buffer->pagecnt_bias == USHRT_MAX - 1

    t2: Buffer is received, and redirected
    +---------------
    | upper buff (skb)
    +---------------
    used by NIC ->| lower buffer
    (rx_buffer) +---------------

    Now, prior calling xdp_do_redirect():
    page count == USHRT_MAX
    rx_buffer->pagecnt_bias == USHRT_MAX - 2

    This means that buffer *cannot* be flipped/reused, because the skb is
    still using it.

    The problem arises when xdp_do_redirect() actually frees the
    segment. Then we get:
    page count == USHRT_MAX - 1
    rx_buffer->pagecnt_bias == USHRT_MAX - 2

    From a recycle perspective, the buffer can be flipped and reused,
    which means that the skb data area is passed to the Rx HW ring!

    To work around this, the page count is stored prior calling
    xdp_do_redirect().

    Note that this is not optimal, since the NIC could actually reuse the
    "lower buffer" again. However, then we need to track whether
    XDP_REDIRECT consumed the buffer or not.

    Fixes: d9314c474d4f ("i40e: add support for XDP_REDIRECT")
    Reported-and-analyzed-by: Li RongQing
    Signed-off-by: Björn Töpel
    Tested-by: George Kuruvinakunnel
    Signed-off-by: Tony Nguyen

    Björn Töpel
     
  • Since we share the transmit queue with the network stack,
    it is possible that we run into a transmit queue timeout.
    This will reset the queue.
    This happens under high load when XDP is using the
    transmit queue pretty much exclusively.

    netdev_start_xmit() sets the trans_start variable of the
    transmit queue to jiffies which is later utilized by dev_watchdog(),
    so to avoid timeout, let stack know that XDP xmit happened by
    bumping the trans_start within XDP Tx routines to jiffies.

    Fixes: 9cbc948b5a20 ("igb: add XDP support")
    Acked-by: Maciej Fijalkowski
    Signed-off-by: Sven Auhagen
    Tested-by: Sandeep Penigalapati
    Signed-off-by: Tony Nguyen

    Sven Auhagen
     
  • Since it is a new XDP implementation change xdp_do_flush_map
    to xdp_do_flush.

    Fixes: 9cbc948b5a20 ("igb: add XDP support")
    Suggested-by: Maciej Fijalkowski
    Reviewed-by: Maciej Fijalkowski
    Acked-by: Maciej Fijalkowski
    Signed-off-by: Sven Auhagen
    Tested-by: Sandeep Penigalapati
    Signed-off-by: Tony Nguyen

    Sven Auhagen