14 Jun, 2020

6 commits

  • Pull networking fixes from David Miller:

    1) Fix cfg80211 deadlock, from Johannes Berg.

    2) RXRPC fails to send norigications, from David Howells.

    3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
    Geliang Tang.

    4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

    5) The ucc_geth driver needs __netdev_watchdog_up exported, from
    Valentin Longchamp.

    6) Fix hashtable memory leak in dccp, from Wang Hai.

    7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

    8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

    9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

    10) Fix link speed reporting in iavf driver, from Brett Creeley.

    11) When a channel is used for XSK and then reused again later for XSK,
    we forget to clear out the relevant data structures in mlx5 which
    causes all kinds of problems. Fix from Maxim Mikityanskiy.

    12) Fix memory leak in genetlink, from Cong Wang.

    13) Disallow sockmap attachments to UDP sockets, it simply won't work.
    From Lorenz Bauer.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
    net: ethernet: ti: ale: fix allmulti for nu type ale
    net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
    net: atm: Remove the error message according to the atomic context
    bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
    libbpf: Support pre-initializing .bss global variables
    tools/bpftool: Fix skeleton codegen
    bpf: Fix memlock accounting for sock_hash
    bpf: sockmap: Don't attach programs to UDP sockets
    bpf: tcp: Recv() should return 0 when the peer socket is closed
    ibmvnic: Flush existing work items before device removal
    genetlink: clean up family attributes allocations
    net: ipa: header pad field only valid for AP->modem endpoint
    net: ipa: program upper nibbles of sequencer type
    net: ipa: fix modem LAN RX endpoint id
    net: ipa: program metadata mask differently
    ionic: add pcie_print_link_status
    rxrpc: Fix race between incoming ACK parser and retransmitter
    net/mlx5: E-Switch, Fix some error pointer dereferences
    net/mlx5: Don't fail driver on failure to create debugfs
    net/mlx5e: CT: Fix ipv6 nat header rewrite actions
    ...

    Linus Torvalds
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2020-06-12

    The following pull-request contains BPF updates for your *net* tree.

    We've added 26 non-merge commits during the last 10 day(s) which contain
    a total of 27 files changed, 348 insertions(+), 93 deletions(-).

    The main changes are:

    1) sock_hash accounting fix, from Andrey.

    2) libbpf fix and probe_mem sanitizing, from Andrii.

    3) sock_hash fixes, from Jakub.

    4) devmap_val fix, from Jesper.

    5) load_bytes_relative fix, from YiFei.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Looking into the context (atomic!) and the error message should be dropped.

    Signed-off-by: Liao Pingfang
    Signed-off-by: David S. Miller

    Liao Pingfang
     
  • Pull more Kbuild updates from Masahiro Yamada:

    - fix build rules in binderfs sample

    - fix build errors when Kbuild recurses to the top Makefile

    - covert '---help---' in Kconfig to 'help'

    * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    treewide: replace '---help---' in Kconfig files with 'help'
    kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
    samples: binderfs: really compile this sample and fix build issues

    Linus Torvalds
     
  • Pull 9p update from Dominique Martinet:
    "Another very quiet cycle... Only one commit: increase the size of the
    ring used for xen transport"

    * tag '9p-for-5.8' of git://github.com/martinetd/linux:
    9p/xen: increase XEN_9PFS_RING_ORDER

    Linus Torvalds
     
  • Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
    '---help---'"), the number of '---help---' has been gradually
    decreasing, but there are still more than 2400 instances.

    This commit finishes the conversion. While I touched the lines,
    I also fixed the indentation.

    There are a variety of indentation styles found.

    a) 4 spaces + '---help---'
    b) 7 spaces + '---help---'
    c) 8 spaces + '---help---'
    d) 1 space + 1 tab + '---help---'
    e) 1 tab + '---help---' (correct indentation)
    f) 1 tab + 1 space + '---help---'
    g) 1 tab + 2 spaces + '---help---'

    In order to convert all of them to 1 tab + 'help', I ran the
    following commend:

    $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

13 Jun, 2020

4 commits

  • Add missed bpf_map_charge_init() in sock_hash_alloc() and
    correspondingly bpf_map_charge_finish() on ENOMEM.

    It was found accidentally while working on unrelated selftest that
    checks "map->memory.pages > 0" is true for all map types.

    Before:
    # bpftool m l
    ...
    3692: sockhash name m_sockhash flags 0x0
    key 4B value 4B max_entries 8 memlock 0B

    After:
    # bpftool m l
    ...
    84: sockmap name m_sockmap flags 0x0
    key 4B value 4B max_entries 8 memlock 4096B

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200612000857.2881453-1-rdna@fb.com

    Andrey Ignatov
     
  • The stream parser infrastructure isn't set up to deal with UDP
    sockets, so we mustn't try to attach programs to them.

    I remember making this change at some point, but I must have lost
    it while rebasing or something similar.

    Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20200611172520.327602-1-lmb@cloudflare.com

    Lorenz Bauer
     
  • If the peer is closed, we will never get more data, so
    tcp_bpf_wait_data will get stuck forever. In case we passed
    MSG_DONTWAIT to recv(), we get EAGAIN but we should actually get
    0.

    >From man 2 recv:

    RETURN VALUE

    When a stream socket peer has performed an orderly shutdown, the
    return value will be 0 (the traditional "end-of-file" return).

    This patch makes tcp_bpf_wait_data always return 1 when the peer
    socket has been shutdown. Either we have data available, and it would
    have returned 1 anyway, or there isn't, in which case we'll call
    tcp_recvmsg which does the right thing in this situation.

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/26038a28c21fea5d04d4bd4744c5686d3f2e5504.1591784177.git.sd@queasysnail.net

    Sabrina Dubroca
     
  • genl_family_rcv_msg_attrs_parse() and genl_family_rcv_msg_attrs_free()
    take a boolean parameter to determine whether allocate/free the family
    attrs. This is unnecessary as we can just check family->parallel_ops.
    More importantly, callers would not need to worry about pairing these
    parameters correctly after this patch.

    And this fixes a memory leak, as after commit c36f05559104
    ("genetlink: fix memory leaks in genl_family_rcv_msg_dumpit()")
    we call genl_family_rcv_msg_attrs_parse() for both parallel and
    non-parallel cases.

    Fixes: c36f05559104 ("genetlink: fix memory leaks in genl_family_rcv_msg_dumpit()")
    Reported-by: Ido Schimmel
    Signed-off-by: Cong Wang
    Reviewed-by: Ido Schimmel
    Tested-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Cong Wang
     

12 Jun, 2020

17 commits

  • There's a race between the retransmission code and the received ACK parser.
    The problem is that the retransmission loop has to drop the lock under
    which it is iterating through the transmission buffer in order to transmit
    a packet, but whilst the lock is dropped, the ACK parser can crank the Tx
    window round and discard the packets from the buffer.

    The retransmission code then updated the annotations for the wrong packet
    and a later retransmission thought it had to retransmit a packet that
    wasn't there, leading to a NULL pointer dereference.

    Fix this by:

    (1) Moving the annotation change to before we drop the lock prior to
    transmission. This means we can't vary the annotation depending on
    the outcome of the transmission, but that's fine - we'll retransmit
    again later if it failed now.

    (2) Skipping the packet if the skb pointer is NULL.

    The following oops was seen:

    BUG: kernel NULL pointer dereference, address: 000000000000002d
    Workqueue: krxrpcd rxrpc_process_call
    RIP: 0010:rxrpc_get_skb+0x14/0x8a
    ...
    Call Trace:
    rxrpc_resend+0x331/0x41e
    ? get_vtime_delta+0x13/0x20
    rxrpc_process_call+0x3c0/0x4ac
    process_one_work+0x18f/0x27f
    worker_thread+0x1a3/0x247
    ? create_worker+0x17d/0x17d
    kthread+0xe6/0xeb
    ? kthread_delayed_work_timer_fn+0x83/0x83
    ret_from_fork+0x1f/0x30

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • Propagate sock_alloc_send_skb error code, not set it to
    EAGAIN unconditionally, when fail to allocate skb, which
    might cause that user space unnecessary loops.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Li RongQing
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1591852266-24017-1-git-send-email-lirongqing@baidu.com

    Li RongQing
     
  • When a bearer is enabled, we create a 'tipc_discoverer' object to store
    the bearer related data along with a timer and a preformatted discovery
    message buffer for later probing... However, this is only carried after
    the bearer was set 'up', that left a race condition resulting in kernel
    panic.

    It occurs when a discovery message from a peer node is received and
    processed in bottom half (since the bearer is 'up' already) just before
    the discoverer object is created but is now accessed in order to update
    the preformatted buffer (with a new trial address, ...) so leads to the
    NULL pointer dereference.

    We solve the problem by simply moving the bearer 'up' setting to later,
    so make sure everything is ready prior to any message receiving.

    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • syzbot found the following issue:

    WARNING: CPU: 0 PID: 6808 at include/linux/thread_info.h:150 check_copy_size include/linux/thread_info.h:150 [inline]
    WARNING: CPU: 0 PID: 6808 at include/linux/thread_info.h:150 copy_from_iter include/linux/uio.h:144 [inline]
    WARNING: CPU: 0 PID: 6808 at include/linux/thread_info.h:150 tipc_msg_append+0x49a/0x5e0 net/tipc/msg.c:242
    Kernel panic - not syncing: panic_on_warn set ...

    This happens after commit 5e9eeccc58f3 ("tipc: fix NULL pointer
    dereference in streaming") that tried to build at least one buffer even
    when the message data length is zero... However, it now exposes another
    bug that the 'mss' can be zero and the 'cpy' will be negative, thus the
    above kernel WARNING will appear!
    The zero value of 'mss' is never expected because it means Nagle is not
    enabled for the socket (actually the socket type was 'SOCK_SEQPACKET'),
    so the function 'tipc_msg_append()' must not be called at all. But that
    was in this particular case since the message data length was zero, and
    the 'send
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • Pull NFS client updates from Anna Schumaker:
    "New features and improvements:
    - Sunrpc receive buffer sizes only change when establishing a GSS credentials
    - Add more sunrpc tracepoints
    - Improve on tracepoints to capture internal NFS I/O errors

    Other bugfixes and cleanups:
    - Move a dprintk() to after a call to nfs_alloc_fattr()
    - Fix off-by-one issues in rpc_ntop6
    - Fix a few coccicheck warnings
    - Use the correct SPDX license identifiers
    - Fix rpc_call_done assignment for BIND_CONN_TO_SESSION
    - Replace zero-length array with flexible array
    - Remove duplicate headers
    - Set invalid blocks after NFSv4 writes to update space_used attribute
    - Fix direct WRITE throughput regression"

    * tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
    NFS: Fix direct WRITE throughput regression
    SUNRPC: rpc_xprt lifetime events should record xprt->state
    xprtrdma: Make xprt_rdma_slot_table_entries static
    nfs: set invalid blocks after NFSv4 writes
    NFS: remove redundant initialization of variable result
    sunrpc: add missing newline when printing parameter 'auth_hashtable_size' by sysfs
    NFS: Add a tracepoint in nfs_set_pgio_error()
    NFS: Trace short NFS READs
    NFS: nfs_xdr_status should record the procedure name
    SUNRPC: Set SOFTCONN when destroying GSS contexts
    SUNRPC: rpc_call_null_helper() should set RPC_TASK_SOFT
    SUNRPC: rpc_call_null_helper() already sets RPC_TASK_NULLCREDS
    SUNRPC: trace RPC client lifetime events
    SUNRPC: Trace transport lifetime events
    SUNRPC: Split the xdr_buf event class
    SUNRPC: Add tracepoint to rpc_call_rpcerror()
    SUNRPC: Update the RPC_SHOW_SOCKET() macro
    SUNRPC: Update the rpc_show_task_flags() macro
    SUNRPC: Trace GSS context lifetimes
    SUNRPC: receive buffer size estimation values almost never change
    ...

    Linus Torvalds
     
  • Fix the following sparse warning:

    net/sunrpc/xprtrdma/transport.c:71:14: warning: symbol 'xprt_rdma_slot_table_entries'
    was not declared. Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: Zou Wei
    Reviewed-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Zou Wei
     
  • When I cat parameter
    '/sys/module/sunrpc/parameters/auth_hashtable_size', it displays as
    follows. It is better to add a newline for easy reading.

    [root@hulk-202 ~]# cat /sys/module/sunrpc/parameters/auth_hashtable_size
    16[root@hulk-202 ~]#

    Signed-off-by: Xiongfeng Wang
    Signed-off-by: Anna Schumaker

    Xiongfeng Wang
     
  • Move the RPC_TASK_SOFTCONN flag into rpc_call_null_helper(). The
    only minor behavior change is that it is now also set when
    destroying GSS contexts.

    This gives a better guarantee that gss_send_destroy_context() will
    not hang for long if a connection cannot be established.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up.

    All of rpc_call_null_helper() call sites assert RPC_TASK_SOFT, so
    move that setting into rpc_call_null_helper() itself.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up.

    Commit a52458b48af1 ("NFS/NFSD/SUNRPC: replace generic creds with
    'struct cred'.") made rpc_call_null_helper() set RPC_TASK_NULLCREDS
    unconditionally. Therefore there's no need for
    rpc_call_null_helper()'s call sites to set RPC_TASK_NULLCREDS.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • The "create" tracepoint records parts of the rpc_create arguments,
    and the shutdown tracepoint records when the rpc_clnt is about to
    signal pending tasks and destroy auths.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Refactor: Hoist create/destroy/disconnect tracepoints out of
    xprtrdma and into the generic RPC client. Some benefits include:

    - Enable tracing of xprt lifetime events for the socket transport
    types

    - Expose the different types of disconnect to help run down
    issues with lingering connections

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • To help tie the recorded xdr_buf to a particular RPC transaction,
    the client side version of this class should display task ID
    information and the server side one should show the request's XID.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Add a tracepoint in another common exit point for failing RPCs.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Avoid unnecessary cache sloshing by placing the buffer size
    estimation update logic behind an atomic bit flag.

    The size of GSS information included in each wrapped Reply does
    not change during the lifetime of a GSS context. Therefore, the
    au_rslack and au_ralign fields need to be updated only once after
    establishing a fresh GSS credential.

    Thus a slack size update must occur after a cred is created,
    duplicated, renewed, or expires. I'm not sure I have this exactly
    right. A trace point is introduced to track updates to these
    variables to enable troubleshooting the problem if I missed a spot.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Pull nfsd updates from Bruce Fields:
    "Highlights:

    - Keep nfsd clients from unnecessarily breaking their own
    delegations.

    Note this requires a small kthreadd addition. The result is Tejun
    Heo's suggestion (see link), and he was OK with this going through
    my tree.

    - Patch nfsd/clients/ to display filenames, and to fix byte-order
    when displaying stateid's.

    - fix a module loading/unloading bug, from Neil Brown.

    - A big series from Chuck Lever with RPC/RDMA and tracing
    improvements, and lay some groundwork for RPC-over-TLS"

    Link: https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com

    * tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux: (49 commits)
    sunrpc: use kmemdup_nul() in gssp_stringify()
    nfsd: safer handling of corrupted c_type
    nfsd4: make drc_slab global, not per-net
    SUNRPC: Remove unreachable error condition in rpcb_getport_async()
    nfsd: Fix svc_xprt refcnt leak when setup callback client failed
    sunrpc: clean up properly in gss_mech_unregister()
    sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations.
    sunrpc: check that domain table is empty at module unload.
    NFSD: Fix improperly-formatted Doxygen comments
    NFSD: Squash an annoying compiler warning
    SUNRPC: Clean up request deferral tracepoints
    NFSD: Add tracepoints for monitoring NFSD callbacks
    NFSD: Add tracepoints to the NFSD state management code
    NFSD: Add tracepoints to NFSD's duplicate reply cache
    SUNRPC: svc_show_status() macro should have enum definitions
    SUNRPC: Restructure svc_udp_recvfrom()
    SUNRPC: Refactor svc_recvfrom()
    SUNRPC: Clean up svc_release_skb() functions
    SUNRPC: Refactor recvfrom path dealing with incomplete TCP receives
    SUNRPC: Replace dprintk() call sites in TCP receive path
    ...

    Linus Torvalds
     

11 Jun, 2020

6 commits

  • Added a check in the switch case on start_header that checks for
    the existence of the header, and in the case that MAC is not set
    and the caller requests for MAC, -EFAULT. If the caller requests
    for NET then MAC's existence is completely ignored.

    There is no function to check NET header's existence and as far
    as cgroup_skb/egress is concerned it should always be set.

    Removed for ptr >= the start of header, considering offset is
    bounded unsigned and should always be true. len
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/76bb820ddb6a95f59a772ecbd8c8a336f646b362.1591812755.git.zhuyifei@google.com

    YiFei Zhu
     
  • If a listening MPTCP socket has unaccepted sockets at close
    time, the related msks are freed via mptcp_sock_destruct(),
    which in turn does not invoke the proto->destroy() method
    nor the mptcp_token_destroy() function.

    Due to the above, the child msk socket is not removed from
    the token container, leading to later UaF.

    Address the issue explicitly removing the token even in the
    above error path.

    Fixes: 79c0949e9a09 ("mptcp: Add key generation and token tree")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Pull sysctl fixes from Al Viro:
    "Fixups to regressions in sysctl series"

    * 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    sysctl: reject gigantic reads/write to sysctl files
    cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
    trace: fix an incorrect __user annotation on stack_trace_sysctl
    random: fix an incorrect __user annotation on proc_do_entropy
    net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
    net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl

    Linus Torvalds
     
  • Pull READ/WRITE_ONCE rework from Will Deacon:
    "This the READ_ONCE rework I've been working on for a while, which
    bumps the minimum GCC version and improves code-gen on arm64 when
    stack protector is enabled"

    [ Side note: I'm _really_ tempted to raise the minimum gcc version to
    4.9, so that we can just say that we require _Generic() support.

    That would allow us to more cleanly handle a lot of the cases where we
    depend on very complex macros with 'sizeof' or __builtin_choose_expr()
    with __builtin_types_compatible_p() etc.

    This branch has a workaround for sparse not handling _Generic(),
    either, but that was already fixed in the sparse development branch,
    so it's really just gcc-4.9 that we'd require. - Linus ]

    * 'rwonce/rework' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux:
    compiler_types.h: Use unoptimized __unqual_scalar_typeof for sparse
    compiler_types.h: Optimize __unqual_scalar_typeof compilation time
    compiler.h: Enforce that READ_ONCE_NOCHECK() access size is sizeof(long)
    compiler-types.h: Include naked type in __pick_integer_type() match
    READ_ONCE: Fix comment describing 2x32-bit atomicity
    gcov: Remove old GCC 3.4 support
    arm64: barrier: Use '__unqual_scalar_typeof' for acquire/release macros
    locking/barriers: Use '__unqual_scalar_typeof' for load-acquire macros
    READ_ONCE: Drop pointer qualifiers when reading from scalar types
    READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses
    READ_ONCE: Simplify implementations of {READ,WRITE}_ONCE()
    arm64: csum: Disable KASAN for do_csum()
    fault_inject: Don't rely on "return value" from WRITE_ONCE()
    net: tls: Avoid assigning 'const' pointer to non-const pointer
    netfilter: Avoid assigning 'const' pointer to non-const pointer
    compiler/gcc: Raise minimum GCC version for kernel builds to 4.8

    Linus Torvalds
     
  • The msk sk_shutdown flag is set by a workqueue, possibly
    introducing some delay in user-space notification. If the last
    subflow carries some data with the fin packet, the user space
    can wake-up before RCV_SHUTDOWN is set. If it executes unblocking
    recvmsg(), it may return with an error instead of eof.

    Address the issue explicitly checking for eof in recvmsg(), when
    no data is found.

    Fixes: 59832e246515 ("mptcp: subflow: check parent mptcp socket on subflow state change")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • fdb nexthops are marked with a flag. For standalone nexthops, a flag was
    added to the nh_info struct. For groups that flag was added to struct
    nexthop when it should have been added to the group information. Fix
    by removing the flag from the nexthop struct and adding a flag to nh_group
    that mirrors nh_info and is really only a caching of the individual types.
    Add a helper, nexthop_is_fdb, for use by the vxlan code and fixup the
    internal code to use the flag from either nh_info or nh_group.

    v2
    - propagate fdb_nh in remove_nh_grp_entry

    Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops")
    Cc: Roopa Prabhu
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

10 Jun, 2020

7 commits

  • There are some memory leaks in dccp_init() and dccp_fini().

    In dccp_fini() and the error handling path in dccp_init(), free lhash2
    is missing. Add inet_hashinfo2_free_mod() to do it.

    If inet_hashinfo2_init_mod() failed in dccp_init(),
    percpu_counter_destroy() should be called to destroy dccp_orphan_count.
    It need to goto out_free_percpu when inet_hashinfo2_init_mod() failed.

    Fixes: c92c81df93df ("net: dccp: fix kernel crash on module load")
    Reported-by: Hulk Robot
    Signed-off-by: Wang Hai
    Signed-off-by: David S. Miller

    Wang Hai
     
  • Since the quiesce/activate rework, __netdev_watchdog_up() is directly
    called in the ucc_geth driver.

    Unfortunately, this function is not available for modules and thus
    ucc_geth cannot be built as a module anymore. Fix it by exporting
    __netdev_watchdog_up().

    Since the commit introducing the regression was backported to stable
    branches, this one should ideally be as well.

    Fixes: 79dde73cf9bc ("net/ethernet/freescale: rework quiesce/activate for ucc_geth")
    Signed-off-by: Valentin Longchamp
    Signed-off-by: David S. Miller

    Valentin Longchamp
     
  • The dynamic key update for addr_list_lock still causes troubles,
    for example the following race condition still exists:

    CPU 0: CPU 1:
    (RCU read lock) (RTNL lock)
    dev_mc_seq_show() netdev_update_lockdep_key()
    -> lockdep_unregister_key()
    -> netif_addr_lock_bh()

    because lockdep doesn't provide an API to update it atomically.
    Therefore, we have to move it back to static keys and use subclass
    for nest locking like before.

    In commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key
    changes"), I already reverted most parts of commit ab92d68fc22f
    ("net: core: add generic lockdep keys").

    This patch reverts the rest and also part of commit f3b0a18bb6cb
    ("net: remove unnecessary variables and callback"). After this
    patch, addr_list_lock changes back to using static keys and
    subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
    not have to change back to ->ndo_get_lock_subclass().

    And hopefully this reduces some syzbot lockdep noises too.

    Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
    Cc: Taehee Yoo
    Cc: Dmitry Vyukov
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • We can end up modifying the sockhash bucket list from two CPUs when a
    sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
    that is in the sockhash is unlinking itself from it on another CPU
    it (sock_hash_delete_from_link).

    This results in accessing a list element that is in an undefined state as
    reported by KASAN:

    | ==================================================================
    | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
    | Write of size 8 at addr dead000000000122 by task kworker/2:1/95
    |
    | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    | Workqueue: events bpf_map_free_deferred
    | Call Trace:
    | dump_stack+0x97/0xe0
    | ? sock_hash_free+0x13c/0x280
    | __kasan_report.cold+0x5/0x40
    | ? mark_lock+0xbc1/0xc00
    | ? sock_hash_free+0x13c/0x280
    | kasan_report+0x38/0x50
    | ? sock_hash_free+0x152/0x280
    | sock_hash_free+0x13c/0x280
    | bpf_map_free_deferred+0xb2/0xd0
    | ? bpf_map_charge_finish+0x50/0x50
    | ? rcu_read_lock_sched_held+0x81/0xb0
    | ? rcu_read_lock_bh_held+0x90/0x90
    | process_one_work+0x59a/0xac0
    | ? lock_release+0x3b0/0x3b0
    | ? pwq_dec_nr_in_flight+0x110/0x110
    | ? rwlock_bug.part.0+0x60/0x60
    | worker_thread+0x7a/0x680
    | ? _raw_spin_unlock_irqrestore+0x4c/0x60
    | kthread+0x1cc/0x220
    | ? process_one_work+0xac0/0xac0
    | ? kthread_create_on_node+0xa0/0xa0
    | ret_from_fork+0x24/0x30
    | ==================================================================

    Fix it by reintroducing spin-lock protected critical section around the
    code that removes the elements from the bucket on sockhash free.

    To do that we also need to defer processing of removed elements, until out
    of atomic context so that we can unlink the socket from the map when
    holding the sock lock.

    Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free")
    Reported-by: Eric Dumazet
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200607205229.2389672-3-jakub@cloudflare.com

    Jakub Sitnicki
     
  • When sockhash gets destroyed while sockets are still linked to it, we will
    walk the bucket lists and delete the links. However, we are not freeing the
    list elements after processing them, leaking the memory.

    The leak can be triggered by close()'ing a sockhash map when it still
    contains sockets, and observed with kmemleak:

    unreferenced object 0xffff888116e86f00 (size 64):
    comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff ...A.....i/.....
    backtrace:
    [] sock_hash_update_common+0x4ca/0x760
    [] sock_hash_update_elem+0x1d2/0x200
    [] __do_sys_bpf+0x2046/0x2990
    [] do_syscall_64+0xad/0x9a0
    [] entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Fix it by freeing the list element when we're done with it.

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200607205229.2389672-2-jakub@cloudflare.com

    Jakub Sitnicki
     
  • When user application calls read() with MSG_PEEK flag to read data
    of bpf sockmap socket, kernel panic happens at
    __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
    queue after read out under MSG_PEEK flag is set. Because it's not
    judged whether sk_msg is the last msg of ingress_msg queue, the next
    sk_msg may be the head of ingress_msg queue, whose memory address of
    sg page is invalid. So it's necessary to add check codes to prevent
    this problem.

    [20759.125457] BUG: kernel NULL pointer dereference, address:
    0000000000000008
    [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: G E
    5.4.32 #1
    [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
    4.1.12 06/18/2017
    [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
    [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
    [20759.276099] tcp_bpf_recvmsg+0x113/0x370
    [20759.281137] inet_recvmsg+0x55/0xc0
    [20759.285734] __sys_recvfrom+0xc8/0x130
    [20759.290566] ? __audit_syscall_entry+0x103/0x130
    [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
    [20759.301700] ? __audit_syscall_exit+0x1e4/0x290
    [20759.307235] __x64_sys_recvfrom+0x24/0x30
    [20759.312226] do_syscall_64+0x55/0x1b0
    [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Signed-off-by: dihu
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20200605084625.9783-1-anny.hu@linux.alibaba.com

    dihu
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse