10 Feb, 2020

1 commit

  • Pull more Kbuild updates from Masahiro Yamada:

    - fix randconfig to generate a sane .config

    - rename hostprogs-y / always to hostprogs / always-y, which are more
    natual syntax.

    - optimize scripts/kallsyms

    - fix yes2modconfig and mod2yesconfig

    - make multiple directory targets ('make foo/ bar/') work

    * tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kbuild: make multiple directory targets work
    kconfig: Invalidate all symbols after changing to y or m.
    kallsyms: fix type of kallsyms_token_table[]
    scripts/kallsyms: change table to store (strcut sym_entry *)
    scripts/kallsyms: rename local variables in read_symbol()
    kbuild: rename hostprogs-y/always to hostprogs/always-y
    kbuild: fix the document to use extra-y for vmlinux.lds
    kconfig: fix broken dependency in randconfig-generated .config

    Linus Torvalds
     

09 Feb, 2020

2 commits

  • Pull networking fixes from David Miller:

    1) Unbalanced locking in mwifiex_process_country_ie, from Brian Norris.

    2) Fix thermal zone registration in iwlwifi, from Andrei
    Otcheretianski.

    3) Fix double free_irq in sgi ioc3 eth, from Thomas Bogendoerfer.

    4) Use after free in mptcp, from Florian Westphal.

    5) Use after free in wireguard's root_remove_peer_lists, from Eric
    Dumazet.

    6) Properly access packets heads in bonding alb code, from Eric
    Dumazet.

    7) Fix data race in skb_queue_len(), from Qian Cai.

    8) Fix regression in r8169 on some chips, from Heiner Kallweit.

    9) Fix XDP program ref counting in hv_netvsc, from Haiyang Zhang.

    10) Certain kinds of set link netlink operations can cause a NULL deref
    in the ipv6 addrconf code. Fix from Eric Dumazet.

    11) Don't cancel uninitialized work queue in drop monitor, from Ido
    Schimmel.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
    net: thunderx: use proper interface type for RGMII
    mt76: mt7615: fix max_nss in mt7615_eeprom_parse_hw_cap
    bpf: Improve bucket_log calculation logic
    selftests/bpf: Test freeing sockmap/sockhash with a socket in it
    bpf, sockhash: Synchronize_rcu before free'ing map
    bpf, sockmap: Don't sleep while holding RCU lock on tear-down
    bpftool: Don't crash on missing xlated program instructions
    bpf, sockmap: Check update requirements after locking
    drop_monitor: Do not cancel uninitialized work item
    mlxsw: spectrum_dpipe: Add missing error path
    mlxsw: core: Add validation of hardware device types for MGPIR register
    mlxsw: spectrum_router: Clear offload indication from IPv6 nexthops on abort
    selftests: mlxsw: Add test cases for local table route replacement
    mlxsw: spectrum_router: Prevent incorrect replacement of local table routes
    net: dsa: microchip: enable module autoprobe
    ipv6/addrconf: fix potential NULL deref in inet6_set_link_af()
    dpaa_eth: support all modes with rate adapting PHYs
    net: stmmac: update pci platform data to use phy_interface
    net: stmmac: xgmac: fix missing IFF_MULTICAST checki in dwxgmac2_set_filter
    net: stmmac: fix missing IFF_MULTICAST check in dwmac4_set_filter
    ...

    Linus Torvalds
     
  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

15 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2020-02-07

    The following pull-request contains BPF updates for your *net* tree.

    We've added 15 non-merge commits during the last 10 day(s) which contain
    a total of 12 files changed, 114 insertions(+), 31 deletions(-).

    The main changes are:

    1) Various BPF sockmap fixes related to RCU handling in the map's tear-
    down code, from Jakub Sitnicki.

    2) Fix macro state explosion in BPF sk_storage map when calculating its
    bucket_log on allocation, from Martin KaFai Lau.

    3) Fix potential BPF sockmap update race by rechecking socket's established
    state under lock, from Lorenz Bauer.

    4) Fix crash in bpftool on missing xlated instructions when kptr_restrict
    sysctl is set, from Toke Høiland-Jørgensen.

    5) Fix i40e's XSK wakeup code to return proper error in busy state and
    various misc fixes in xdpsock BPF sample code, from Maciej Fijalkowski.

    6) Fix the way modifiers are skipped in BTF in the verifier while walking
    pointers to avoid program rejection, from Alexei Starovoitov.

    7) Fix Makefile for runqslower BPF tool to i) rebuild on libbpf changes and
    ii) to fix undefined reference linker errors for older gcc version due to
    order of passed gcc parameters, from Yulia Kartseva and Song Liu.

    8) Fix a trampoline_count BPF kselftest warning about missing braces around
    initializer, from Andrii Nakryiko.

    9) Fix up redundant "HAVE" prefix from large INSN limit kernel probe in
    bpftool, from Michal Rostecki.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull nfsd updates from Bruce Fields:
    "Highlights:

    - Server-to-server copy code from Olga.

    To use it, client and both servers must have support, the target
    server must be able to access the source server over NFSv4.2, and
    the target server must have the inter_copy_offload_enable module
    parameter set.

    - Improvements and bugfixes for the new filehandle cache, especially
    in the container case, from Trond

    - Also from Trond, better reporting of write errors.

    - Y2038 work from Arnd"

    * tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux: (55 commits)
    sunrpc: expiry_time should be seconds not timeval
    nfsd: make nfsd_filecache_wq variable static
    nfsd4: fix double free in nfsd4_do_async_copy()
    nfsd: convert file cache to use over/underflow safe refcount
    nfsd: Define the file access mode enum for tracing
    nfsd: Fix a perf warning
    nfsd: Ensure sampling of the write verifier is atomic with the write
    nfsd: Ensure sampling of the commit verifier is atomic with the commit
    sunrpc: clean up cache entry add/remove from hashtable
    sunrpc: Fix potential leaks in sunrpc_cache_unhash()
    nfsd: Ensure exclusion between CLONE and WRITE errors
    nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()
    nfsd: Update the boot verifier on stable writes too.
    nfsd: Fix stable writes
    nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argument
    nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create()
    nfsd: Reduce the number of calls to nfsd_file_gc()
    nfsd: Schedule the laundrette regularly irrespective of file errors
    nfsd: Remove unused constant NFSD_FILE_LRU_RESCAN
    nfsd: Containerise filecache laundrette
    ...

    Linus Torvalds
     
  • Puyll NFS client updates from Anna Schumaker:
    "Stable bugfixes:
    - Fix memory leaks and corruption in readdir # v2.6.37+
    - Directory page cache needs to be locked when read # v2.6.37+

    New features:
    - Convert NFS to use the new mount API
    - Add "softreval" mount option to let clients use cache if server goes down
    - Add a config option to compile without UDP support
    - Limit the number of inactive delegations the client can cache at once
    - Improved readdir concurrency using iterate_shared()

    Other bugfixes and cleanups:
    - More 64-bit time conversions
    - Add additional diagnostic tracepoints
    - Check for holes in swapfiles, and add dependency on CONFIG_SWAP
    - Various xprtrdma cleanups to prepare for 5.7's changes
    - Several fixes for NFS writeback and commit handling
    - Fix acls over krb5i/krb5p mounts
    - Recover from premature loss of openstateids
    - Fix NFS v3 chacl and chmod bug
    - Compare creds using cred_fscmp()
    - Use kmemdup_nul() in more places
    - Optimize readdir cache page invalidation
    - Lease renewal and recovery fixes"

    * tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits)
    NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals
    NFSv4: try lease recovery on NFS4ERR_EXPIRED
    NFS: Fix memory leaks
    nfs: optimise readdir cache page invalidation
    NFS: Switch readdir to using iterate_shared()
    NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()
    NFS: Directory page cache pages need to be locked when read
    NFS: Fix memory leaks and corruption in readdir
    SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()
    NFS: Replace various occurrences of kstrndup() with kmemdup_nul()
    NFSv4: Limit the total number of cached delegations
    NFSv4: Add accounting for the number of active delegations held
    NFSv4: Try to return the delegation immediately when marked for return on close
    NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
    NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
    NFS: nfs_find_open_context() should use cred_fscmp()
    NFS: nfs_access_get_cached_rcu() should use cred_fscmp()
    NFSv4: pnfs_roc() must use cred_fscmp() to compare creds
    NFS: remove unused macros
    nfs: Return EINVAL rather than ERANGE for mount parse errors
    ...

    Linus Torvalds
     
  • It was reported that the max_t, ilog2, and roundup_pow_of_two macros have
    exponential effects on the number of states in the sparse checker.

    This patch breaks them up by calculating the "nbuckets" first so that the
    "bucket_log" only needs to take ilog2().

    In addition, Linus mentioned:

    Patch looks good, but I'd like to point out that it's not just sparse.

    You can see it with a simple

    make net/core/bpf_sk_storage.i
    grep 'smap->bucket_log = ' net/core/bpf_sk_storage.i | wc

    and see the end result:

    1 365071 2686974

    That's one line (the assignment line) that is 2,686,974 characters in
    length.

    Now, sparse does happen to react particularly badly to that (I didn't
    look to why, but I suspect it's just that evaluating all the types
    that don't actually ever end up getting used ends up being much more
    expensive than it should be), but I bet it's not good for gcc either.

    Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
    Reported-by: Randy Dunlap
    Reported-by: Luc Van Oostenryck
    Suggested-by: Linus Torvalds
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Luc Van Oostenryck
    Link: https://lore.kernel.org/bpf/20200207081810.3918919-1-kafai@fb.com

    Martin KaFai Lau
     
  • We need to have a synchronize_rcu before free'ing the sockhash because any
    outstanding psock references will have a pointer to the map and when they
    use it, this could trigger a use after free.

    This is a sister fix for sockhash, following commit 2bb90e5cc90e ("bpf:
    sockmap, synchronize_rcu before free'ing map") which addressed sockmap,
    which comes from a manual audit.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200206111652.694507-3-jakub@cloudflare.com

    Jakub Sitnicki
     
  • rcu_read_lock is needed to protect access to psock inside sock_map_unref
    when tearing down the map. However, we can't afford to sleep in lock_sock
    while in RCU read-side critical section. Grab the RCU lock only after we
    have locked the socket.

    This fixes RCU warnings triggerable on a VM with 1 vCPU when free'ing a
    sockmap/sockhash that contains at least one socket:

    | =============================
    | WARNING: suspicious RCU usage
    | 5.5.0-04005-g8fc91b972b73 #450 Not tainted
    | -----------------------------
    | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
    |
    | other info that might help us debug this:
    |
    |
    | rcu_scheduler_active = 2, debug_locks = 1
    | 4 locks held by kworker/0:1/62:
    | #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
    | #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
    | #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170
    | #3: ffff8881368c5df8 (&stab->lock){+...}, at: sock_map_free+0x64/0x170
    |
    | stack backtrace:
    | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73 #450
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    | Workqueue: events bpf_map_free_deferred
    | Call Trace:
    | dump_stack+0x71/0xa0
    | ___might_sleep+0x105/0x190
    | lock_sock_nested+0x28/0x90
    | sock_map_free+0x95/0x170
    | bpf_map_free_deferred+0x58/0x80
    | process_one_work+0x260/0x5e0
    | worker_thread+0x4d/0x3e0
    | kthread+0x108/0x140
    | ? process_one_work+0x5e0/0x5e0
    | ? kthread_park+0x90/0x90
    | ret_from_fork+0x3a/0x50

    | =============================
    | WARNING: suspicious RCU usage
    | 5.5.0-04005-g8fc91b972b73-dirty #452 Not tainted
    | -----------------------------
    | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
    |
    | other info that might help us debug this:
    |
    |
    | rcu_scheduler_active = 2, debug_locks = 1
    | 4 locks held by kworker/0:1/62:
    | #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
    | #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
    | #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_hash_free+0x5/0x1d0
    | #3: ffff888139966e00 (&htab->buckets[i].lock){+...}, at: sock_hash_free+0x92/0x1d0
    |
    | stack backtrace:
    | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73-dirty #452
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    | Workqueue: events bpf_map_free_deferred
    | Call Trace:
    | dump_stack+0x71/0xa0
    | ___might_sleep+0x105/0x190
    | lock_sock_nested+0x28/0x90
    | sock_hash_free+0xec/0x1d0
    | bpf_map_free_deferred+0x58/0x80
    | process_one_work+0x260/0x5e0
    | worker_thread+0x4d/0x3e0
    | kthread+0x108/0x140
    | ? process_one_work+0x5e0/0x5e0
    | ? kthread_park+0x90/0x90
    | ret_from_fork+0x3a/0x50

    Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200206111652.694507-2-jakub@cloudflare.com

    Jakub Sitnicki
     
  • It's currently possible to insert sockets in unexpected states into
    a sockmap, due to a TOCTTOU when updating the map from a syscall.
    sock_map_update_elem checks that sk->sk_state == TCP_ESTABLISHED,
    locks the socket and then calls sock_map_update_common. At this
    point, the socket may have transitioned into another state, and
    the earlier assumptions don't hold anymore. Crucially, it's
    conceivable (though very unlikely) that a socket has become unhashed.
    This breaks the sockmap's assumption that it will get a callback
    via sk->sk_prot->unhash.

    Fix this by checking the (fixed) sk_type and sk_protocol without the
    lock, followed by a locked check of sk_state.

    Unfortunately it's not possible to push the check down into
    sock_(map|hash)_update_common, since BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
    run before the socket has transitioned from TCP_SYN_RECV into
    TCP_ESTABLISHED.

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20200207103713.28175-1-lmb@cloudflare.com

    Lorenz Bauer
     
  • The former contains nothing but a pointer to an array of the latter...

    Signed-off-by: Al Viro

    Al Viro
     
  • Unused now.

    Signed-off-by: Eric Sandeen
    Acked-by: David Howells
    Signed-off-by: Al Viro

    Eric Sandeen
     
  • ... and now errorf() et.al. are never called with NULL fs_context,
    so we can get rid of conditional in those.

    Signed-off-by: Al Viro

    Al Viro
     
  • fs_parse() analogue taking p_log instead of fs_context.
    fs_parse() turned into a wrapper, callers in ceph_common and rbd
    switched to __fs_parse().

    As the result, fs_parse() never gets NULL fs_context and neither
    do fs_context-based logging primitives

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • When upcalling gssproxy, cache_head.expiry_time is set as a
    timeval, not seconds since boot. As such, RPC cache expiry
    logic will not clean expired objects created under
    auth.rpcsec.context cache.

    This has proven to cause kernel memory leaks on field. Using
    64 bit variants of getboottime/timespec

    Expiration times have worked this way since 2010's c5b29f885afe "sunrpc:
    use seconds since boot in expiry cache". The gssproxy code introduced
    in 2012 added gss_proxy_save_rsc and introduced the bug. That's a while
    for this to lurk, but it required a bit of an extreme case to make it
    obvious.

    Signed-off-by: Roberto Bergantinos Corpas
    Cc: stable@vger.kernel.org
    Fixes: 030d794bf498 "SUNRPC: Use gssproxy upcall for server..."
    Tested-By: Frank Sorenson
    Signed-off-by: J. Bruce Fields

    Roberto Bergantinos Corpas
     
  • Drop monitor uses a work item that takes care of constructing and
    sending netlink notifications to user space. In case drop monitor never
    started to monitor, then the work item is uninitialized and not
    associated with a function.

    Therefore, a stop command from user space results in canceling an
    uninitialized work item which leads to the following warning [1].

    Fix this by not processing a stop command if drop monitor is not
    currently monitoring.

    [1]
    [ 31.735402] ------------[ cut here ]------------
    [ 31.736470] WARNING: CPU: 0 PID: 143 at kernel/workqueue.c:3032 __flush_work+0x89f/0x9f0
    ...
    [ 31.738120] CPU: 0 PID: 143 Comm: dwdump Not tainted 5.5.0-custom-09491-g16d4077796b8 #727
    [ 31.741968] RIP: 0010:__flush_work+0x89f/0x9f0
    ...
    [ 31.760526] Call Trace:
    [ 31.771689] __cancel_work_timer+0x2a6/0x3b0
    [ 31.776809] net_dm_cmd_trace+0x300/0xef0
    [ 31.777549] genl_rcv_msg+0x5c6/0xd50
    [ 31.781005] netlink_rcv_skb+0x13b/0x3a0
    [ 31.784114] genl_rcv+0x29/0x40
    [ 31.784720] netlink_unicast+0x49f/0x6a0
    [ 31.787148] netlink_sendmsg+0x7cf/0xc80
    [ 31.790426] ____sys_sendmsg+0x620/0x770
    [ 31.793458] ___sys_sendmsg+0xfd/0x170
    [ 31.802216] __sys_sendmsg+0xdf/0x1a0
    [ 31.806195] do_syscall_64+0xa0/0x540
    [ 31.806885] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fixes: 8e94c3bc922e ("drop_monitor: Allow user to start monitoring hardware drops")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • __in6_dev_get(dev) called from inet6_set_link_af() can return NULL.

    The needed check has been recently removed, let's add it back.

    While do_setlink() does call validate_linkmsg() :
    ...
    err = validate_linkmsg(dev, tb); /* OK at this point */
    ...

    It is possible that the following call happening before the
    ->set_link_af() removes IPv6 if MTU is less than 1280 :

    if (tb[IFLA_MTU]) {
    err = dev_set_mtu_ext(dev, nla_get_u32(tb[IFLA_MTU]), extack);
    if (err < 0)
    goto errout;
    status |= DO_SETLINK_MODIFIED;
    }
    ...

    if (tb[IFLA_AF_SPEC]) {
    ...
    err = af_ops->set_link_af(dev, af);
    ->inet6_set_link_af() // CRASH because idev is NULL

    Please note that IPv4 is immune to the bug since inet_set_link_af() does :

    struct in_device *in_dev = __in_dev_get_rcu(dev);
    if (!in_dev)
    return -EAFNOSUPPORT;

    This problem has been mentioned in commit cf7afbfeb8ce ("rtnl: make
    link af-specific updates atomic") changelog :

    This method is not fail proof, while it is currently sufficient
    to make set_link_af() inerrable and thus 100% atomic, the
    validation function method will not be able to detect all error
    scenarios in the future, there will likely always be errors
    depending on states which are f.e. not protected by rtnl_mutex
    and thus may change between validation and setting.

    IPv6: ADDRCONF(NETDEV_CHANGE): lo: link becomes ready
    general protection fault, probably for non-canonical address 0xdffffc0000000056: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x00000000000002b0-0x00000000000002b7]
    CPU: 0 PID: 9698 Comm: syz-executor712 Not tainted 5.5.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:inet6_set_link_af+0x66e/0xae0 net/ipv6/addrconf.c:5733
    Code: 38 d0 7f 08 84 c0 0f 85 20 03 00 00 48 8d bb b0 02 00 00 45 0f b6 64 24 04 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84 c0 74 08 3c 03 0f 8e 1a 03 00 00 44 89 a3 b0 02 00
    RSP: 0018:ffffc90005b06d40 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff86df39a6
    RDX: 0000000000000056 RSI: ffffffff86df3e74 RDI: 00000000000002b0
    RBP: ffffc90005b06e70 R08: ffff8880a2ac0380 R09: ffffc90005b06db0
    R10: fffff52000b60dbe R11: ffffc90005b06df7 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8880a1fcc424 R15: dffffc0000000000
    FS: 0000000000c46880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f0494ca0d0 CR3: 000000009e4ac000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    do_setlink+0x2a9f/0x3720 net/core/rtnetlink.c:2754
    rtnl_group_changelink net/core/rtnetlink.c:3103 [inline]
    __rtnl_newlink+0xdd1/0x1790 net/core/rtnetlink.c:3257
    rtnl_newlink+0x69/0xa0 net/core/rtnetlink.c:3377
    rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
    netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
    rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
    netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
    netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
    netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:672
    ____sys_sendmsg+0x753/0x880 net/socket.c:2343
    ___sys_sendmsg+0x100/0x170 net/socket.c:2397
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2430
    __do_sys_sendmsg net/socket.c:2439 [inline]
    __se_sys_sendmsg net/socket.c:2437 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4402e9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fffd62fbcf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004402e9
    RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000008 R09: 00000000004002c8
    R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000401b70
    R13: 0000000000401c00 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    ---[ end trace cfa7664b8fdcdff3 ]---
    RIP: 0010:inet6_set_link_af+0x66e/0xae0 net/ipv6/addrconf.c:5733
    Code: 38 d0 7f 08 84 c0 0f 85 20 03 00 00 48 8d bb b0 02 00 00 45 0f b6 64 24 04 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84 c0 74 08 3c 03 0f 8e 1a 03 00 00 44 89 a3 b0 02 00
    RSP: 0018:ffffc90005b06d40 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff86df39a6
    RDX: 0000000000000056 RSI: ffffffff86df3e74 RDI: 00000000000002b0
    RBP: ffffc90005b06e70 R08: ffff8880a2ac0380 R09: ffffc90005b06db0
    R10: fffff52000b60dbe R11: ffffc90005b06df7 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8880a1fcc424 R15: dffffc0000000000
    FS: 0000000000c46880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000004 CR3: 000000009e4ac000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 7dc2bccab0ee ("Validate required parameters in inet6_validate_link_af")
    Signed-off-by: Eric Dumazet
    Bisected-and-reported-by: syzbot
    Cc: Maxim Mikityanskiy
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Feb, 2020

8 commits

  • When using taprio offloading together with ETF offloading, configured
    like this, for example:

    $ tc qdisc replace dev $IFACE parent root handle 100 taprio \
    num_tc 4 \
    map 2 2 1 0 3 2 2 2 2 2 2 2 2 2 2 2 \
    queues 1@0 1@1 1@2 1@3 \
    base-time $BASE_TIME \
    sched-entry S 01 1000000 \
    sched-entry S 0e 1000000 \
    flags 0x2

    $ tc qdisc replace dev $IFACE parent 100:1 etf \
    offload delta 300000 clockid CLOCK_TAI

    During enqueue, it works out that the verification added for the
    "txtime" assisted mode is run when using taprio + ETF offloading, the
    only thing missing is initializing the 'next_txtime' of all the cycle
    entries. (if we don't set 'next_txtime' all packets from SO_TXTIME
    sockets are dropped)

    Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • When destroying the current taprio instance, which can happen when the
    creation of one fails, we should reset the traffic class configuration
    back to the default state.

    netdev_reset_tc() is a better way because in addition to setting the
    number of traffic classes to zero, it also resets the priority to
    traffic classes mapping to the default value.

    Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • netlink policy validation for the 'flags' argument was missing.

    Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • Because 'q->flags' starts as zero, and zero is a valid value, we
    aren't able to detect the transition from zero to something else
    during "runtime".

    The solution is to initialize 'q->flags' with an invalid value, so we
    can detect if 'q->flags' was set by the user or not.

    To better solidify the behavior, 'flags' handling is moved to a
    separate function. The behavior is:
    - 'flags' if unspecified by the user, is assumed to be zero;
    - 'flags' cannot change during "runtime" (i.e. a change() request
    cannot modify it);

    With this new function we can remove taprio_flags, which should reduce
    the risk of future accidents.

    Allowing flags to be changed was causing the following RCU stall:

    [ 1730.558249] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
    [ 1730.558258] rcu: 6-...0: (190 ticks this GP) idle=922/0/0x1 softirq=25580/25582 fqs=16250
    [ 1730.558264] (detected by 2, t=65002 jiffies, g=33017, q=81)
    [ 1730.558269] Sending NMI from CPU 2 to CPUs 6:
    [ 1730.559277] NMI backtrace for cpu 6
    [ 1730.559277] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G E 5.5.0-rc6+ #35
    [ 1730.559278] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ULTRA/Z390 AORUS ULTRA-CF, BIOS F7 03/14/2019
    [ 1730.559278] RIP: 0010:__hrtimer_run_queues+0xe2/0x440
    [ 1730.559278] Code: 48 8b 43 28 4c 89 ff 48 8b 75 c0 48 89 45 c8 e8 f4 bb 7c 00 0f 1f 44 00 00 65 8b 05 40 31 f0 68 89 c0 48 0f a3 05 3e 5c 25 01 82 fc 01 00 00 48 8b 45 c8 48 89 df ff d0 89 45 c8 0f 1f 44 00
    [ 1730.559279] RSP: 0018:ffff9970802d8f10 EFLAGS: 00000083
    [ 1730.559279] RAX: 0000000000000006 RBX: ffff8b31645bff38 RCX: 0000000000000000
    [ 1730.559280] RDX: 0000000000000000 RSI: ffffffff9710f2ec RDI: ffffffff978daf0e
    [ 1730.559280] RBP: ffff9970802d8f68 R08: 0000000000000000 R09: 0000000000000000
    [ 1730.559280] R10: 0000018336d7944e R11: 0000000000000001 R12: ffff8b316e39f9c0
    [ 1730.559281] R13: ffff8b316e39f940 R14: ffff8b316e39f998 R15: ffff8b316e39f7c0
    [ 1730.559281] FS: 0000000000000000(0000) GS:ffff8b316e380000(0000) knlGS:0000000000000000
    [ 1730.559281] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1730.559281] CR2: 00007f1105303760 CR3: 0000000227210005 CR4: 00000000003606e0
    [ 1730.559282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 1730.559282] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 1730.559282] Call Trace:
    [ 1730.559282]
    [ 1730.559283] ? taprio_dequeue_soft+0x2d0/0x2d0 [sch_taprio]
    [ 1730.559283] hrtimer_interrupt+0x104/0x220
    [ 1730.559283] ? irqtime_account_irq+0x34/0xa0
    [ 1730.559283] smp_apic_timer_interrupt+0x6d/0x230
    [ 1730.559284] apic_timer_interrupt+0xf/0x20
    [ 1730.559284]
    [ 1730.559284] RIP: 0010:cpu_idle_poll+0x35/0x1a0
    [ 1730.559285] Code: 88 82 ff 65 44 8b 25 12 7d 73 68 0f 1f 44 00 00 e8 90 c3 89 ff fb 65 48 8b 1c 25 c0 7e 01 00 48 8b 03 a8 08 74 0b eb 1c f3 90 8b 03 a8 08 75 13 8b 05 be a8 a8 00 85 c0 75 ed e8 75 48 84 ff
    [ 1730.559285] RSP: 0018:ffff997080137ea8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
    [ 1730.559285] RAX: 0000000000000001 RBX: ffff8b316bc3c580 RCX: 0000000000000000
    [ 1730.559286] RDX: 0000000000000001 RSI: 000000002819aad9 RDI: ffffffff978da730
    [ 1730.559286] RBP: ffff997080137ec0 R08: 0000018324a6d387 R09: 0000000000000000
    [ 1730.559286] R10: 0000000000000400 R11: 0000000000000001 R12: 0000000000000006
    [ 1730.559286] R13: ffff8b316bc3c580 R14: 0000000000000000 R15: 0000000000000000
    [ 1730.559287] ? cpu_idle_poll+0x20/0x1a0
    [ 1730.559287] ? cpu_idle_poll+0x20/0x1a0
    [ 1730.559287] do_idle+0x4d/0x1f0
    [ 1730.559287] ? complete+0x44/0x50
    [ 1730.559288] cpu_startup_entry+0x1b/0x20
    [ 1730.559288] start_secondary+0x142/0x180
    [ 1730.559288] secondary_startup_64+0xb6/0xc0
    [ 1776.686313] nvme nvme0: I/O 96 QID 1 timeout, completion polled

    Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • If the driver implementing taprio offloading depends on the value of
    the network device number of traffic classes (dev->num_tc) for
    whatever reason, it was going to receive the value zero. The value was
    only set after the offloading function is called.

    So, moving setting the number of traffic classes to before the
    offloading function is called fixes this issue. This is safe because
    this only happens when taprio is instantiated (we don't allow this
    configuration to be changed without first removing taprio).

    Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading")
    Reported-by: Po Liu
    Signed-off-by: Vinicius Costa Gomes
    Acked-by: Vladimir Oltean
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • rxrpc_rcu_destroy_call(), which is called as an RCU callback to clean up a
    put call, calls rxrpc_put_connection() which, deep in its bowels, takes a
    number of spinlocks in a non-BH-safe way, including rxrpc_conn_id_lock and
    local->client_conns_lock. RCU callbacks, however, are normally called from
    softirq context, which can cause lockdep to notice the locking
    inconsistency.

    To get lockdep to detect this, it's necessary to have the connection
    cleaned up on the put at the end of the last of its calls, though normally
    the clean up is deferred. This can be induced, however, by starting a call
    on an AF_RXRPC socket and then closing the socket without reading the
    reply.

    Fix this by having rxrpc_rcu_destroy_call() punt the destruction to a
    workqueue if in softirq-mode and defer the destruction to process context.

    Note that another way to fix this could be to add a bunch of bh-disable
    annotations to the spinlocks concerned - and there might be more than just
    those two - but that means spending more time with BHs disabled.

    Note also that some of these places were covered by bh-disable spinlocks
    belonging to the rxrpc_transport object, but these got removed without the
    _bh annotation being retained on the next lock in.

    Fixes: 999b69f89241 ("rxrpc: Kill the client connection bundle concept")
    Reported-by: syzbot+d82f3ac8d87e7ccbb2c9@syzkaller.appspotmail.com
    Reported-by: syzbot+3f1fd6b8cbf8702d134e@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    cc: Hillf Danton
    Signed-off-by: David S. Miller

    David Howells
     
  • The recent patch that substituted a flag on an rxrpc_call for the
    connection pointer being NULL as an indication that a call was disconnected
    puts the set_bit in the wrong place for service calls. This is only a
    problem if a call is implicitly terminated by a new call coming in on the
    same connection channel instead of a terminating ACK packet.

    In such a case, rxrpc_input_implicit_end_call() calls
    __rxrpc_disconnect_call(), which is now (incorrectly) setting the
    disconnection bit, meaning that when rxrpc_release_call() is later called,
    it doesn't call rxrpc_disconnect_call() and so the call isn't removed from
    the peer's error distribution list and the list gets corrupted.

    KASAN finds the issue as an access after release on a call, but the
    position at which it occurs is confusing as it appears to be related to a
    different call (the call site is where the latter call is being removed
    from the error distribution list and either the next or pprev pointer
    points to a previously released call).

    Fix this by moving the setting of the flag from __rxrpc_disconnect_call()
    to rxrpc_disconnect_call() in the same place that the connection pointer
    was being cleared.

    Fixes: 5273a191dca6 ("rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • Don't do a single array; attach them to fsparam_enum() entry
    instead. And don't bother trying to embed the names into those -
    it actually loses memory, with no real speedup worth mentioning.

    Simplifies validation as well.

    Signed-off-by: Al Viro

    Al Viro
     

06 Feb, 2020

4 commits

  • The bug is that we call kfree_skb(skb) and then pass "skb" to
    qdisc_pkt_len(skb) on the next line, which is a use after free.
    Also Cong Wang points out that it's better to delay the actual
    frees until we drop the rtnl lock so we should use rtnl_kfree_skbs()
    instead of kfree_skb().

    Cc: Cong Wang
    Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
    Signed-off-by: Dan Carpenter
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • sk_buff.qlen can be accessed concurrently as noticed by KCSAN,

    BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg

    read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
    unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
    net/unix/af_unix.c:1761
    ____sys_sendmsg+0x33e/0x370
    ___sys_sendmsg+0xa6/0xf0
    __sys_sendmsg+0x69/0xf0
    __x64_sys_sendmsg+0x51/0x70
    do_syscall_64+0x91/0xb47
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
    __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
    __skb_try_recv_datagram+0xbe/0x220
    unix_dgram_recvmsg+0xee/0x850
    ____sys_recvmsg+0x1fb/0x210
    ___sys_recvmsg+0xa2/0xf0
    __sys_recvmsg+0x66/0xf0
    __x64_sys_recvmsg+0x51/0x70
    do_syscall_64+0x91/0xb47
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Since only the read is operating as lockless, it could introduce a logic
    bug in unix_recvq_full() due to the load tearing. Fix it by adding
    a lockless variant of skb_queue_len() and unix_recvq_full() where
    READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
    the commit d7d16a89350a ("net: add skb_queue_empty_lockless()").

    Signed-off-by: Qian Cai
    Signed-off-by: David S. Miller

    Qian Cai
     
  • Pull ceph fixes from Ilya Dryomov:

    - a set of patches that fixes various corner cases in mount and umount
    code (Xiubo Li). This has to do with choosing an MDS, distinguishing
    between laggy and down MDSes and parsing the server path.

    - inode initialization fixes (Jeff Layton). The one included here
    mostly concerns things like open_by_handle() and there is another one
    that will come through Al.

    - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
    The existing copy-from op turned out to be infeasible for generic
    filesystem use; we disable the copy offload if OSDs don't support
    copy-from2.

    - a patch to link "rbd" and "block" devices together in sysfs (Hannes
    Reinecke)

    ... and a smattering of cleanups from Xiubo, Jeff and Chengguang.

    * tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
    rbd: set the 'device' link in sysfs
    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
    ceph: print name of xattr in __ceph_{get,set}xattr() douts
    ceph: print r_direct_hash in hex in __choose_mds() dout
    ceph: use copy-from2 op in copy_file_range
    ceph: close holes in structs ceph_mds_session and ceph_mds_request
    rbd: work around -Wuninitialized warning
    ceph: allocate the correct amount of extra bytes for the session features
    ceph: rename get_session and switch to use ceph_get_mds_session
    ceph: remove the extra slashes in the server path
    ceph: add possible_max_rank and make the code more readable
    ceph: print dentry offset in hex and fix xattr_version type
    ceph: only touch the caps which have the subset mask requested
    ceph: don't clear I_NEW until inode metadata is fully populated
    ceph: retry the same mds later after the new session is opened
    ceph: check availability of mds cluster on mount after wait timeout
    ceph: keep the session state until it is released
    ceph: add __send_request helper
    ceph: ensure we have a new cap before continuing in fill_inode
    ceph: drop unused ttl_from parameter from fill_inode
    ...

    Linus Torvalds
     
  • Turns out that when we accept a new subflow, the newly created
    inet_sk(tcp_sk)->pinet6 points at the ipv6_pinfo structure of the
    listener socket.

    This wasn't caught by the selftest because it closes the accepted fd
    before the listening one.

    adding a close(listenfd) after accept returns is enough:
    BUG: KASAN: use-after-free in inet6_getname+0x6ba/0x790
    Read of size 1 at addr ffff88810e310866 by task mptcp_connect/2518
    Call Trace:
    inet6_getname+0x6ba/0x790
    __sys_getpeername+0x10b/0x250
    __x64_sys_getpeername+0x6f/0xb0

    also alter test program to exercise this.

    Reported-by: Christoph Paasch
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

05 Feb, 2020

3 commits

  • commit fdd41ec21e15 ("devlink: Return right error code in case of errors
    for region read") modified the region read code to report errors
    properly in unexpected cases.

    In the case where the start_offset and ret_offset match, it unilaterally
    converted this into an error. This causes an issue for the "dump"
    version of the command. In this case, the devlink region dump will
    always report an invalid argument:

    000000000000ffd0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    000000000000ffe0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    devlink answers: Invalid argument
    000000000000fff0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

    This occurs because the expected flow for the dump is to return 0 after
    there is no further data.

    The simplest fix would be to stop converting the error code to -EINVAL
    if start_offset == ret_offset. However, avoid unnecessary work by
    checking for when start_offset is larger than the region size and
    returning 0 upfront.

    Fixes: fdd41ec21e15 ("devlink: Return right error code in case of errors for region read")
    Signed-off-by: Jacob Keller
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • Jakub noticed there is a potential resource leak in
    tcindex_set_parms(): when tcindex_filter_result_init() fails
    and it jumps to 'errout1' which doesn't release the memory
    and resources allocated by tcindex_alloc_perfect_hash().

    We should just jump to 'errout_alloc' which calls
    tcindex_free_perfect_hash().

    Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
    Reported-by: Jakub Kicinski
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • When an mptcp socket connects to a tcp peer or when a middlebox interferes
    with tcp options, mptcp needs to fall back to plain tcp.
    Problem is that mptcp is trying to be too clever in this case:

    It attempts to close the mptcp meta sk and transparently replace it with
    the (only) subflow tcp sk.

    Unfortunately, this is racy -- the socket is already exposed to userspace.
    Any parallel calls to send/recv/setsockopt etc. can cause use-after-free:

    BUG: KASAN: use-after-free in atomic_try_cmpxchg include/asm-generic/atomic-instrumented.h:693 [inline]
    CPU: 1 PID: 2083 Comm: syz-executor.1 Not tainted 5.5.0 #2
    atomic_try_cmpxchg include/asm-generic/atomic-instrumented.h:693 [inline]
    queued_spin_lock include/asm-generic/qspinlock.h:78 [inline]
    do_raw_spin_lock include/linux/spinlock.h:181 [inline]
    __raw_spin_lock_bh include/linux/spinlock_api_smp.h:136 [inline]
    _raw_spin_lock_bh+0x71/0xd0 kernel/locking/spinlock.c:175
    spin_lock_bh include/linux/spinlock.h:343 [inline]
    __lock_sock+0x105/0x190 net/core/sock.c:2414
    lock_sock_nested+0x10f/0x140 net/core/sock.c:2938
    lock_sock include/net/sock.h:1516 [inline]
    mptcp_setsockopt+0x2f/0x1f0 net/mptcp/protocol.c:800
    __sys_setsockopt+0x152/0x240 net/socket.c:2130
    __do_sys_setsockopt net/socket.c:2146 [inline]
    __se_sys_setsockopt net/socket.c:2143 [inline]
    __x64_sys_setsockopt+0xba/0x150 net/socket.c:2143
    do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    While the use-after-free can be resolved, there is another problem:
    sock->ops and sock->sk assignments are not atomic, i.e. we may get calls
    into mptcp functions with sock->sk already pointing at the subflow socket,
    or calls into tcp functions with a mptcp meta sk.

    Remove the fallback code and call the relevant functions for the (only)
    subflow in case the mptcp socket is connected to tcp peer.

    Reported-by: Christoph Paasch
    Diagnosed-by: Paolo Abeni
    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Tested-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Florian Westphal
     

04 Feb, 2020

7 commits

  • Pull networking fixes from David Miller:

    1) Use after free in rxrpc_put_local(), from David Howells.

    2) Fix 64-bit division error in mlxsw, from Nathan Chancellor.

    3) Make sure we clear various bits of TCP state in response to
    tcp_disconnect(). From Eric Dumazet.

    4) Fix netlink attribute policy in cls_rsvp, from Eric Dumazet.

    5) txtimer must be deleted in stmmac suspend(), from Nicolin Chen.

    6) Fix TC queue mapping in bnxt_en driver, from Michael Chan.

    7) Various netdevsim fixes from Taehee Yoo (use of uninitialized data,
    snapshot panics, stack out of bounds, etc.)

    8) cls_tcindex changes hash table size after allocating the table, fix
    from Cong Wang.

    9) Fix regression in the enforcement of session ID uniqueness in l2tp.
    We only have to enforce uniqueness for IP based tunnels not UDP
    ones. From Ridge Kennedy.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits)
    gtp: use __GFP_NOWARN to avoid memalloc warning
    l2tp: Allow duplicate session creation with UDP
    r8152: Add MAC passthrough support to new device
    net_sched: fix an OOB access in cls_tcindex
    qed: Remove set but not used variable 'p_link'
    tc-testing: add missing 'nsPlugin' to basic.json
    tc-testing: fix eBPF tests failure on linux fresh clones
    net: hsr: fix possible NULL deref in hsr_handle_frame()
    netdevsim: remove unused sdev code
    netdevsim: use __GFP_NOWARN to avoid memalloc warning
    netdevsim: use IS_ERR instead of IS_ERR_OR_NULL for debugfs
    netdevsim: fix stack-out-of-bounds in nsim_dev_debugfs_init()
    netdevsim: fix panic in nsim_dev_take_snapshot_write()
    netdevsim: disable devlink reload when resources are being used
    netdevsim: fix using uninitialized resources
    bnxt_en: Fix TC queue mapping.
    bnxt_en: Fix logic that disables Bus Master during firmware reset.
    bnxt_en: Fix RDMA driver failure with SRIOV after firmware reset.
    bnxt_en: Refactor logic to re-enable SRIOV after firmware reset detected.
    net: stmmac: Delete txtimer in suspend()
    ...

    Linus Torvalds
     
  • In the past it was possible to create multiple L2TPv3 sessions with the
    same session id as long as the sessions belonged to different tunnels.
    The resulting sessions had issues when used with IP encapsulated tunnels,
    but worked fine with UDP encapsulated ones. Some applications began to
    rely on this behaviour to avoid having to negotiate unique session ids.

    Some time ago a change was made to require session ids to be unique across
    all tunnels, breaking the applications making use of this "feature".

    This change relaxes the duplicate session id check to allow duplicates
    if both of the colliding sessions belong to UDP encapsulated tunnels.

    Fixes: dbdbc73b4478 ("l2tp: fix duplicate session creation")
    Signed-off-by: Ridge Kennedy
    Acked-by: James Chapman
    Signed-off-by: David S. Miller

    Ridge Kennedy
     
  • As Eric noticed, tcindex_alloc_perfect_hash() uses cp->hash
    to compute the size of memory allocation, but cp->hash is
    set again after the allocation, this caused an out-of-bound
    access.

    So we have to move all cp->hash initialization and computation
    before the memory allocation. Move cp->mask and cp->shift together
    as cp->hash may need them for computation too.

    Reported-and-tested-by: syzbot+35d4dea36c387813ed31@syzkaller.appspotmail.com
    Fixes: 331b72922c5f ("net: sched: RCU cls_tcindex")
    Cc: Eric Dumazet
    Cc: John Fastabend
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Cc: Jakub Kicinski
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • hsr_port_get_rcu() can return NULL, so we need to be careful.

    general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
    CPU: 1 PID: 10249 Comm: syz-executor.5 Not tainted 5.5.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
    RIP: 0010:hsr_addr_is_self+0x86/0x330 net/hsr/hsr_framereg.c:44
    Code: 04 00 f3 f3 f3 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 e8 6b ff 94 f9 4c 89 f2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 3c 02 00 0f 85 75 02 00 00 48 8b 43 30 49 39 c6 49 89 47 c0 0f
    RSP: 0018:ffffc90000da8a90 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff87e0cc33
    RDX: 0000000000000006 RSI: ffffffff87e035d5 RDI: 0000000000000000
    RBP: ffffc90000da8b20 R08: ffff88808e7de040 R09: ffffed1015d2707c
    R10: ffffed1015d2707b R11: ffff8880ae9383db R12: ffff8880a689bc5e
    R13: 1ffff920001b5153 R14: 0000000000000030 R15: ffffc90000da8af8
    FS: 00007fd7a42be700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000001b32338000 CR3: 00000000a928c000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    hsr_handle_frame+0x1c5/0x630 net/hsr/hsr_slave.c:31
    __netif_receive_skb_core+0xfbc/0x30b0 net/core/dev.c:5099
    __netif_receive_skb_one_core+0xa8/0x1a0 net/core/dev.c:5196
    __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5312
    process_backlog+0x206/0x750 net/core/dev.c:6144
    napi_poll net/core/dev.c:6582 [inline]
    net_rx_action+0x508/0x1120 net/core/dev.c:6650
    __do_softirq+0x262/0x98c kernel/softirq.c:292
    do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082

    Fixes: c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • 'PTR_ERR(p) == -E*' is a stronger condition than IS_ERR(p).
    Hence, IS_ERR(p) is unneeded.

    The semantic patch that generates this commit is as follows:

    //
    @@
    expression ptr;
    constant error_code;
    @@
    -IS_ERR(ptr) && (PTR_ERR(ptr) == - error_code)
    +PTR_ERR(ptr) == - error_code
    //

    Link: http://lkml.kernel.org/r/20200106045833.1725-1-masahiroy@kernel.org
    Signed-off-by: Masahiro Yamada
    Cc: Julia Lawall
    Acked-by: Stephen Boyd [drivers/clk/clk.c]
    Acked-by: Bartosz Golaszewski [GPIO]
    Acked-by: Wolfram Sang [drivers/i2c]
    Acked-by: Rafael J. Wysocki [acpi/scan.c]
    Acked-by: Rob Herring
    Cc: Eric Biggers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Using kmemdup_nul() is more efficient when the length is known.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Anna Schumaker

    Trond Myklebust