12 Apr, 2020

1 commit


09 Apr, 2020

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "The main items are:

    - support for asynchronous create and unlink (Jeff Layton).

    Creates and unlinks are satisfied locally, without waiting for a
    reply from the MDS, provided the client has been granted
    appropriate caps (new in v15.y.z ("Octopus") release). This can be
    a big help for metadata heavy workloads such as tar and rsync.
    Opt-in with the new nowsync mount option.

    - multiple blk-mq queues for rbd (Hannes Reinecke and myself).

    When the driver was converted to blk-mq, we settled on a single
    blk-mq queue because of a global lock in libceph and some other
    technical debt. These have since been addressed, so allocate a
    queue per CPU to enhance parallelism.

    - don't hold onto caps that aren't actually needed (Zheng Yan).

    This has been our long-standing behavior, but it causes issues with
    some active/standby applications (synchronous I/O, stalls if the
    standby goes down, etc).

    - .snap directory timestamps consistent with ceph-fuse (Luis
    Henriques)"

    * tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client: (49 commits)
    ceph: fix snapshot directory timestamps
    ceph: wait for async creating inode before requesting new max size
    ceph: don't skip updating wanted caps when cap is stale
    ceph: request new max size only when there is auth cap
    ceph: cleanup return error of try_get_cap_refs()
    ceph: return ceph_mdsc_do_request() errors from __get_parent()
    ceph: check all mds' caps after page writeback
    ceph: update i_requested_max_size only when sending cap msg to auth mds
    ceph: simplify calling of ceph_get_fmode()
    ceph: remove delay check logic from ceph_check_caps()
    ceph: consider inode's last read/write when calculating wanted caps
    ceph: always renew caps if mds_wanted is insufficient
    ceph: update dentry lease for async create
    ceph: attempt to do async create when possible
    ceph: cache layout in parent dir on first sync create
    ceph: add new MDS req field to hold delegated inode number
    ceph: decode interval_sets for delegated inos
    ceph: make ceph_fill_inode non-static
    ceph: perform asynchronous unlink if we have sufficient caps
    ceph: don't take refs to want mask unless we have all bits
    ...

    Linus Torvalds
     

08 Apr, 2020

4 commits

  • Now that the kernel specifies binutils 2.23 as the minimum version, we
    can remove ifdefs for AVX2 and ADX throughout.

    Signed-off-by: Jason A. Donenfeld
    Acked-by: Ingo Molnar
    Reviewed-by: Nick Desaulniers
    Signed-off-by: Masahiro Yamada

    Jason A. Donenfeld
     
  • Doing this probing inside of the Makefiles means we have a maze of
    ifdefs inside the source code and child Makefiles that need to make
    proper decisions on this too. Instead, we do it at Kconfig time, like
    many other compiler and assembler options, which allows us to set up the
    dependencies normally for full compilation units. In the process, the
    ADX test changes to use %eax instead of %r10 so that it's valid in both
    32-bit and 64-bit mode.

    Signed-off-by: Jason A. Donenfeld
    Acked-by: Ingo Molnar
    Reviewed-by: Nick Desaulniers
    Signed-off-by: Masahiro Yamada

    Jason A. Donenfeld
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable fixes:
    - Fix a page leak in nfs_destroy_unlinked_subrequests()

    - Fix use-after-free issues in nfs_pageio_add_request()

    - Fix new mount code constant_table array definitions

    - finish_automount() requires us to hold 2 refs to the mount record

    Features:
    - Improve the accuracy of telldir/seekdir by using 64-bit cookies
    when possible.

    - Allow one RDMA active connection and several zombie connections to
    prevent blocking if the remote server is unresponsive.

    - Limit the size of the NFS access cache by default

    - Reduce the number of references to credentials that are taken by
    NFS

    - pNFS files and flexfiles drivers now support per-layout segment
    COMMIT lists.

    - Enable partial-file layout segments in the pNFS/flexfiles driver.

    - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type

    - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
    the DS using the layouterror mechanism.

    Bugfixes and cleanups:
    - SUNRPC: Fix krb5p regressions

    - Don't specify NFS version in "UDP not supported" error

    - nfsroot: set tcp as the default transport protocol

    - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()

    - alloc_nfs_open_context() must use the file cred when available

    - Fix locking when dereferencing the delegation cred

    - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails

    - Various clean ups of the NFS O_DIRECT commit code

    - Clean up RDMA connect/disconnect

    - Replace zero-length arrays with C99-style flexible arrays"

    * tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits)
    NFS: Clean up process of marking inode stale.
    SUNRPC: Don't start a timer on an already queued rpc task
    NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()
    NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode()
    NFS: Beware when dereferencing the delegation cred
    NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout
    NFS: finish_automount() requires us to hold 2 refs to the mount record
    NFS: Fix a few constant_table array definitions
    NFS: Try to join page groups before an O_DIRECT retransmission
    NFS: Refactor nfs_lock_and_join_requests()
    NFS: Reverse the submission order of requests in __nfs_pageio_add_request()
    NFS: Clean up nfs_lock_and_join_requests()
    NFS: Remove the redundant function nfs_pgio_has_mirroring()
    NFS: Fix memory leaks in nfs_pageio_stop_mirroring()
    NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()
    NFS: Fix use-after-free issues in nfs_pageio_add_request()
    NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
    NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()
    NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio()
    pNFS/flexfiles: Specify the layout segment range in LAYOUTGET
    ...

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Slave bond and team devices should not be assigned ipv6 link local
    addresses, from Jarod Wilson.

    2) Fix clock sink config on some at803x PHY devices, from Oleksij
    Rempel.

    3) Uninitialized stack space transmitted in slcan frames, fix from
    Richard Palethorpe.

    4) Guard HW VLAN ops properly in stmmac driver, from Jose Abreu.

    5) "=" --> "|=" fix in aquantia driver, from Colin Ian King.

    6) Fix TCP fallback in mptcp, from Florian Westphal. (accessing a plain
    tcp_sk as if it were an mptcp socket).

    7) Fix cavium driver in some configurations wrt. PTP, from Yue Haibing.

    8) Make ipv6 and ipv4 consistent in the lower bound allowed for
    neighbour entry retrans_time, from Hangbin Liu.

    9) Don't use private workqueue in pegasus usb driver, from Petko
    Manolov.

    10) Fix integer overflow in mlxsw, from Colin Ian King.

    11) Missing refcnt init in cls_tcindex, from Cong Wang.

    12) One too many loop iterations when processing cmpri entries in ipv6
    rpl code, from Alexander Aring.

    13) Disable SG and TSO by default in r8169, from Heiner Kallweit.

    14) NULL deref in macsec, from Davide Caratti.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits)
    macsec: fix NULL dereference in macsec_upd_offload()
    skbuff.h: Improve the checksum related comments
    net: dsa: bcm_sf2: Ensure correct sub-node is parsed
    qed: remove redundant assignment to variable 'rc'
    wimax: remove some redundant assignments to variable result
    mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_VLAN_MANGLE
    mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_PRIORITY
    r8169: change back SG and TSO to be disabled by default
    net: dsa: bcm_sf2: Do not register slave MDIO bus with OF
    ipv6: rpl: fix loop iteration
    tun: Don't put_page() for all negative return values from XDP program
    net: dsa: mt7530: fix null pointer dereferencing in port5 setup
    mptcp: add some missing pr_fmt defines
    net: phy: micrel: kszphy_resume(): add delay after genphy_resume() before accessing PHY registers
    net_sched: fix a missing refcnt in tcindex_init()
    net: stmmac: dwmac1000: fix out-of-bounds mac address reg setting
    mlxsw: spectrum_trap: fix unintention integer overflow on left shift
    pegasus: Remove pegasus' own workqueue
    neigh: support smaller retrans_time settting
    net: openvswitch: use hlist_for_each_entry_rcu instead of hlist_for_each_entry
    ...

    Linus Torvalds
     

07 Apr, 2020

1 commit

  • This patch fix the loop iteration by not walking over the last
    iteration. The cmpri compressing value exempt the last segment. As the
    code shows the last iteration will be overwritten by cmpre value
    handling which is for the last segment.

    I think this doesn't end in any bufferoverflows because we work on worst
    case temporary buffer sizes but it ends in not best compression settings
    in some cases.

    Fixes: 8610c7c6e3bd ("net: ipv6: add support for rpl sr exthdr")
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     

06 Apr, 2020

1 commit

  • Pull 9p updates from Dominique Martinet:
    "Not much new, but a few patches for this cycle:

    - Fix read with O_NONBLOCK to allow incomplete read and return
    immediately

    - Rest is just cleanup (indent, unused field in struct, extra
    semicolon)"

    * tag '9p-for-5.7' of git://github.com/martinetd/linux:
    net/9p: remove unused p9_req_t aux field
    9p: read only once on O_NONBLOCK
    9pnet: allow making incomplete read requests
    9p: Remove unneeded semicolon
    9p: Fix Kconfig indentation

    Linus Torvalds
     

05 Apr, 2020

3 commits

  • Move the test for whether a task is already queued to prevent
    corruption of the timer list in __rpc_sleep_on_priority_timeout().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Pull keyrings fixes from David Howells:
    "Here's a couple of patches that fix a circular dependency between
    holding key->sem and mm->mmap_sem when reading data from a key.

    One potential issue is that a filesystem looking to use a key inside,
    say, ->readpages() could deadlock if the key being read is the key
    that's required and the buffer the key is being read into is on a page
    that needs to be fetched.

    The case actually detected is a bit more involved - with a filesystem
    calling request_key() and locking the target keyring for write - which
    could be being read"

    * tag 'keys-fixes-20200329' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    KEYS: Avoid false positive ENOMEM error on key read
    KEYS: Don't write out to userspace while holding key semaphore

    Linus Torvalds
     
  • Pull nfsd updates from Chuck Lever:

    - Fix EXCHANGE_ID response when NFSD runs in a container

    - A battery of new static trace points

    - Socket transports now use bio_vec to send Replies

    - NFS/RDMA now supports filesystems with no .splice_read method

    - Favor memcpy() over DMA mapping for small RPC/RDMA Replies

    - Add pre-requisites for supporting multiple Write chunks

    - Numerous minor fixes and clean-ups

    [ Chuck is filling in for Bruce this time while he and his family settle
    into a new house ]

    * tag 'nfsd-5.7' of git://git.linux-nfs.org/projects/cel/cel-2.6: (39 commits)
    svcrdma: Fix leak of transport addresses
    SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
    SUNRPC/cache: don't allow invalid entries to be flushed
    nfsd: fsnotify on rmdir under nfsd/clients/
    nfsd4: kill warnings on testing stateids with mismatched clientids
    nfsd: remove read permission bit for ctl sysctl
    NFSD: Fix NFS server build errors
    sunrpc: Add tracing for cache events
    SUNRPC/cache: Allow garbage collection of invalid cache entries
    nfsd: export upcalls must not return ESTALE when mountd is down
    nfsd: Add tracepoints for update of the expkey and export cache entries
    nfsd: Add tracepoints for exp_find_key() and exp_get_by_name()
    nfsd: Add tracing to nfsd_set_fh_dentry()
    nfsd: Don't add locks to closed or closing open stateids
    SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends
    SUNRPC: Refactor xs_sendpages()
    svcrdma: Avoid DMA mapping small RPC Replies
    svcrdma: Fix double sync of transport header buffer
    svcrdma: Refactor chunk list encoders
    SUNRPC: Add encoders for list item discriminators
    ...

    Linus Torvalds
     

04 Apr, 2020

3 commits

  • Some of the mptcp logs didn't print out the format string:

    [ 185.651493] DSS
    [ 185.651494] data_fin=0 dsn64=0 use_map=0 ack64=1 use_ack=1
    [ 185.651494] data_ack=13792750332298763796
    [ 185.651495] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 data_avail=0 skb=0000000063dc595d
    [ 185.651495] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 status=0
    [ 185.651495] MPTCP: msk ack_seq=9bbc894565aa2f9a subflow ack_seq=9bbc894565aa2f9a
    [ 185.651496] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 data_avail=1 skb=0000000012e809e1

    So this patch added these missing pr_fmt defines. Then we can get the same
    format string "MPTCP" in all mptcp logs like this:

    [ 142.795829] MPTCP: DSS
    [ 142.795829] MPTCP: data_fin=0 dsn64=0 use_map=0 ack64=1 use_ack=1
    [ 142.795829] MPTCP: data_ack=8089704603109242421
    [ 142.795830] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 data_avail=0 skb=00000000d5f230df
    [ 142.795830] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 status=0
    [ 142.795831] MPTCP: msk ack_seq=66790290f1199d9b subflow ack_seq=66790290f1199d9b
    [ 142.795831] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 data_avail=1 skb=00000000de5aca2e

    Signed-off-by: Geliang Tang
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • The initial refcnt of struct tcindex_data should be 1,
    it is clear that I forgot to set it to 1 in tcindex_init().
    This leads to a dec-after-zero warning.

    Reported-by: syzbot+8325e509a1bf83ec741d@syzkaller.appspotmail.com
    Fixes: 304e024216a8 ("net_sched: add a temporary refcnt for struct tcindex_data")
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Cc: Paul E. McKenney
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • Pull SPDX updates from Greg KH:
    "Here are three SPDX patches for 5.7-rc1.

    One fixes up the SPDX tag for a single driver, while the other two go
    through the tree and add SPDX tags for all of the .gitignore files as
    needed.

    Nothing too complex, but you will get a merge conflict with your
    current tree, that should be trivial to handle (one file modified by
    two things, one file deleted.)

    All three of these have been in linux-next for a while, with no
    reported issues other than the merge conflict"

    * tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
    ASoC: MT6660: make spdxcheck.py happy
    .gitignore: add SPDX License Identifier
    .gitignore: remove too obvious comments

    Linus Torvalds
     

03 Apr, 2020

3 commits

  • Currently, we limited the retrans_time to be greater than HZ/2. i.e.
    setting retrans_time less than 500ms will not work. This makes the user
    unable to achieve a more accurate control for bonding arp fast failover.

    Update the sanity check to HZ/100, which is 10ms, to let users have more
    ability on the retrans_time control.

    v3: sync the behavior with IPv6 and update all the timer handler
    v2: use HZ instead of hard code number

    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     
  • The struct sw_flow is protected by RCU, when traversing them,
    use hlist_for_each_entry_rcu.

    Signed-off-by: Tonghao Zhang
    Tested-by: Greg Rose
    Reviewed-by: Greg Rose
    Signed-off-by: David S. Miller

    Tonghao Zhang
     
  • Currently, SO_BINDTODEVICE requires CAP_NET_RAW. This change allows a
    non-root user to bind a socket to an interface if it is not already
    bound. This is useful to allow an application to bind itself to a
    specific VRF for outgoing or incoming connections. Currently, an
    application wanting to manage connections through several VRF need to
    be privileged.

    Previously, IP_UNICAST_IF and IPV6_UNICAST_IF were added for
    Wine (76e21053b5bf3 and c4062dfc425e9) specifically for use by
    non-root processes. However, they are restricted to sendmsg() and not
    usable with TCP. Allowing SO_BINDTODEVICE would allow TCP clients to
    get the same privilege. As for TCP servers, outside the VRF use case,
    SO_BINDTODEVICE would only further restrict connections a server could
    accept.

    When an application is restricted to a VRF (with `ip vrf exec`), the
    socket is bound to an interface at creation and therefore, a
    non-privileged call to SO_BINDTODEVICE to escape the VRF fails.

    When an application bound a socket to SO_BINDTODEVICE and transmit it
    to a non-privileged process through a Unix socket, a tentative to
    change the bound device also fails.

    Before:

    >>> import socket
    >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
    Traceback (most recent call last):
    File "", line 1, in
    PermissionError: [Errno 1] Operation not permitted

    After:

    >>> import socket
    >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
    >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
    Traceback (most recent call last):
    File "", line 1, in
    PermissionError: [Errno 1] Operation not permitted

    Signed-off-by: Vincent Bernat
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Vincent Bernat
     

02 Apr, 2020

8 commits

  • Obtained with:

    $ make W=1 net/mptcp/token.o
    net/mptcp/token.c:53: warning: Function parameter or member 'req' not described in 'mptcp_token_new_request'
    net/mptcp/token.c:98: warning: Function parameter or member 'sk' not described in 'mptcp_token_new_connect'
    net/mptcp/token.c:133: warning: Function parameter or member 'conn' not described in 'mptcp_token_new_accept'
    net/mptcp/token.c:178: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy_request'
    net/mptcp/token.c:191: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy'

    Fixes: 79c0949e9a09 (mptcp: Add key generation and token tree)
    Fixes: 58b09919626b (mptcp: create msk early)
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Matthieu Baerts
     
  • mptcp_subflow_data_available() is commonly called via
    ssk->sk_data_ready(), in this case the mptcp socket lock
    cannot be acquired.

    Therefore, while we can safely discard subflow data that
    was already received up to msk->ack_seq, we cannot be sure
    that 'subflow->data_avail' will still be valid at the time
    userspace wants to read the data -- a previous read on a
    different subflow might have carried this data already.

    In that (unlikely) event, msk->ack_seq will have been updated
    and will be ahead of the subflow dsn.

    We can check for this condition and skip/resync to the expected
    sequence number.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This is needed at least until proper MPTCP-Level fin/reset
    signalling gets added:

    We wake parent when a subflow changes, but we should do this only
    when all subflows have closed, not just one.

    Schedule the mptcp worker and tell it to check eof state on all
    subflows.

    Only flag mptcp socket as closed and wake userspace processes blocking
    in poll if all subflows have closed.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Christoph Paasch reports following crash:

    general protection fault [..]
    CPU: 0 PID: 2874 Comm: syz-executor072 Not tainted 5.6.0-rc5 #62
    RIP: 0010:__pv_queued_spin_lock_slowpath kernel/locking/qspinlock.c:471
    [..]
    queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline]
    do_raw_spin_lock include/linux/spinlock.h:181 [inline]
    spin_lock_bh include/linux/spinlock.h:343 [inline]
    __mptcp_flush_join_list+0x44/0xb0 net/mptcp/protocol.c:278
    mptcp_shutdown+0xb3/0x230 net/mptcp/protocol.c:1882
    [..]

    Problem is that mptcp_shutdown() socket isn't an mptcp socket,
    its a plain tcp_sk. Thus, trying to access mptcp_sk specific
    members accesses garbage.

    Root cause is that accept() returns a fallback (tcp) socket, not an mptcp
    one. There is code in getpeername to detect this and override the sockets
    stream_ops. But this will only run when accept() caller provided a
    sockaddr struct. "accept(fd, NULL, 0)" will therefore result in
    mptcp stream ops, but with sock->sk pointing at a tcp_sk.

    Update the existing fallback handling to detect this as well.

    Moreover, mptcp_shutdown did not have fallback handling, and
    mptcp_poll did it too late so add that there as well.

    Reported-by: Christoph Paasch
    Tested-by: Christoph Paasch
    Reviewed-by: Mat Martineau
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • The variable err is being initialized with a value that is never
    read and it is being updated later with a new value. The initialization
    is redundant and can be removed.

    Addresses-Coverity: ("Unused value")
    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • Fixes: f41071407c85 ("net: dsa: implement auto-normalization of MTU for bridge hardware datapath")
    Signed-off-by: kbuild test robot
    Signed-off-by: David S. Miller

    kbuild test robot
     
  • Bonding slave and team port devices should not have link-local addresses
    automatically added to them, as it can interfere with openvswitch being
    able to properly add tc ingress.

    Basic reproducer, courtesy of Marcelo:

    $ ip link add name bond0 type bond
    $ ip link set dev ens2f0np0 master bond0
    $ ip link set dev ens2f1np2 master bond0
    $ ip link set dev bond0 up
    $ ip a s
    1: lo: mtu 65536 qdisc noqueue state UNKNOWN
    group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever
    2: ens2f0np0: mtu 1500 qdisc
    mq master bond0 state UP group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
    5: ens2f1np2: mtu 1500 qdisc
    mq master bond0 state DOWN group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
    11: bond0: mtu 1500 qdisc
    noqueue state UP group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::20f:53ff:fe2f:ea40/64 scope link
    valid_lft forever preferred_lft forever

    (above trimmed to relevant entries, obviously)

    $ sysctl net.ipv6.conf.ens2f0np0.addr_gen_mode=0
    net.ipv6.conf.ens2f0np0.addr_gen_mode = 0
    $ sysctl net.ipv6.conf.ens2f1np2.addr_gen_mode=0
    net.ipv6.conf.ens2f1np2.addr_gen_mode = 0

    $ ip a l ens2f0np0
    2: ens2f0np0: mtu 1500 qdisc
    mq master bond0 state UP group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::20f:53ff:fe2f:ea40/64 scope link tentative
    valid_lft forever preferred_lft forever
    $ ip a l ens2f1np2
    5: ens2f1np2: mtu 1500 qdisc
    mq master bond0 state DOWN group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::20f:53ff:fe2f:ea40/64 scope link tentative
    valid_lft forever preferred_lft forever

    Looks like addrconf_sysctl_addr_gen_mode() bypasses the original "is
    this a slave interface?" check added by commit c2edacf80e15, and
    results in an address getting added, while w/the proposed patch added,
    no address gets added. This simply adds the same gating check to another
    code path, and thus should prevent the same devices from erroneously
    obtaining an ipv6 link-local address.

    Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode")
    Reported-by: Moshe Levi
    CC: Stephen Hemminger
    CC: Marcelo Ricardo Leitner
    CC: netdev@vger.kernel.org
    Signed-off-by: Jarod Wilson
    Signed-off-by: David S. Miller

    Jarod Wilson
     
  • Although we intentionally use an ordered workqueue for all tc
    filter works, the ordering is not guaranteed by RCU work,
    given that tcf_queue_work() is esstenially a call_rcu().

    This problem is demostrated by Thomas:

    CPU 0:
    tcf_queue_work()
    tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);

    -> Migration to CPU 1

    CPU 1:
    tcf_queue_work(&p->rwork, tcindex_destroy_work);

    so the 2nd work could be queued before the 1st one, which leads
    to a free-after-free.

    Enforcing this order in RCU work is hard as it requires to change
    RCU code too. Fortunately we can workaround this problem in tcindex
    filter by taking a temporary refcnt, we only refcnt it right before
    we begin to destroy it. This simplifies the code a lot as a full
    refcnt requires much more changes in tcindex_set_parms().

    Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
    Fixes: 3d210534cc93 ("net_sched: fix a race condition in tcindex_destroy()")
    Cc: Thomas Gleixner
    Cc: Paul E. McKenney
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Reviewed-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Cong Wang
     

01 Apr, 2020

4 commits

  • Pull networking updates from David Miller:
    "Highlights:

    1) Fix the iwlwifi regression, from Johannes Berg.

    2) Support BSS coloring and 802.11 encapsulation offloading in
    hardware, from John Crispin.

    3) Fix some potential Spectre issues in qtnfmac, from Sergey
    Matyukevich.

    4) Add TTL decrement action to openvswitch, from Matteo Croce.

    5) Allow paralleization through flow_action setup by not taking the
    RTNL mutex, from Vlad Buslov.

    6) A lot of zero-length array to flexible-array conversions, from
    Gustavo A. R. Silva.

    7) Align XDP statistics names across several drivers for consistency,
    from Lorenzo Bianconi.

    8) Add various pieces of infrastructure for offloading conntrack, and
    make use of it in mlx5 driver, from Paul Blakey.

    9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.

    10) Lots of parallelization improvements during configuration changes
    in mlxsw driver, from Ido Schimmel.

    11) Add support to devlink for generic packet traps, which report
    packets dropped during ACL processing. And use them in mlxsw
    driver. From Jiri Pirko.

    12) Support bcmgenet on ACPI, from Jeremy Linton.

    13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
    Starovoitov, and your's truly.

    14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.

    15) Fix sysfs permissions when network devices change namespaces, from
    Christian Brauner.

    16) Add a flags element to ethtool_ops so that drivers can more simply
    indicate which coalescing parameters they actually support, and
    therefore the generic layer can validate the user's ethtool
    request. Use this in all drivers, from Jakub Kicinski.

    17) Offload FIFO qdisc in mlxsw, from Petr Machata.

    18) Support UDP sockets in sockmap, from Lorenz Bauer.

    19) Fix stretch ACK bugs in several TCP congestion control modules,
    from Pengcheng Yang.

    20) Support virtual functiosn in octeontx2 driver, from Tomasz
    Duszynski.

    21) Add region operations for devlink and use it in ice driver to dump
    NVM contents, from Jacob Keller.

    22) Add support for hw offload of MACSEC, from Antoine Tenart.

    23) Add support for BPF programs that can be attached to LSM hooks,
    from KP Singh.

    24) Support for multiple paths, path managers, and counters in MPTCP.
    From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
    and others.

    25) More progress on adding the netlink interface to ethtool, from
    Michal Kubecek"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
    net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
    cxgb4/chcr: nic-tls stats in ethtool
    net: dsa: fix oops while probing Marvell DSA switches
    net/bpfilter: remove superfluous testing message
    net: macb: Fix handling of fixed-link node
    net: dsa: ksz: Select KSZ protocol tag
    netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
    net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
    net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
    net: stmmac: create dwmac-intel.c to contain all Intel platform
    net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
    net: dsa: bcm_sf2: Add support for matching VLAN TCI
    net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
    net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
    net: dsa: bcm_sf2: Disable learning for ASP port
    net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
    net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
    net: dsa: b53: Restore VLAN entries upon (re)configuration
    net: dsa: bcm_sf2: Fix overflow checks
    hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
    ...

    Linus Torvalds
     
  • In case memory resources for buf were allocated, release them before
    return.

    Addresses-Coverity-ID: 1492011 ("Resource leak")
    Fixes: a7a29f9c361f ("net: ipv6: add rpl sr tunnel")
    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     
  • Fix an oops in dsa_port_phylink_mac_change() caused by a combination
    of a20f997010c4 ("net: dsa: Don't instantiate phylink for CPU/DSA
    ports unless needed") and the net-dsa-improve-serdes-integration
    series of patches 65b7a2c8e369 ("Merge branch
    'net-dsa-improve-serdes-integration'").

    Unable to handle kernel NULL pointer dereference at virtual address 00000124
    pgd = c0004000
    [00000124] *pgd=00000000
    Internal error: Oops: 805 [#1] SMP ARM
    Modules linked in: tag_edsa spi_nor mtd xhci_plat_hcd mv88e6xxx(+) xhci_hcd armada_thermal marvell_cesa dsa_core ehci_orion libdes phy_armada38x_comphy at24 mcp3021 sfp evbug spi_orion sff mdio_i2c
    CPU: 1 PID: 214 Comm: irq/55-mv88e6xx Not tainted 5.6.0+ #470
    Hardware name: Marvell Armada 380/385 (Device Tree)
    PC is at phylink_mac_change+0x10/0x88
    LR is at mv88e6352_serdes_irq_status+0x74/0x94 [mv88e6xxx]

    Signed-off-by: Russell King
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Russell King
     
  • A testing message was brought by 13d0f7b814d9 ("net/bpfilter: fix dprintf
    usage for /dev/kmsg") but should've been deleted before patch submission.
    Although it doesn't cause any harm to the code or functionality itself, it's
    totally unpleasant to have it displayed on every loop iteration with no real
    use case. Thus remove it unconditionally.

    Fixes: 13d0f7b814d9 ("net/bpfilter: fix dprintf usage for /dev/kmsg")
    Signed-off-by: Bruno Meneguele
    Signed-off-by: David S. Miller

    Bruno Meneguele
     

31 Mar, 2020

11 commits

  • David S. Miller
     
  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for net-next:

    1) Add support to specify a stateful expression in set definitions,
    this allows users to specify e.g. counters per set elements.

    2) Flowtable software counter support.

    3) Flowtable hardware offload counter support, from wenxu.

    3) Parallelize flowtable hardware offload requests, from Paul Blakey.
    This includes a patch to add one work entry per offload command.

    4) Several patches to rework nf_queue refcount handling, from Florian
    Westphal.

    4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling
    information is missing and set up indirect flow block as TC_SETUP_FT,
    patch from wenxu.

    5) Stricter netlink attribute sanity check on filters, from Romain Bellan
    and Florent Fourcot.

    5) Annotations to make sparse happy, from Jules Irenge.

    6) Improve icmp errors in debugging information, from Haishuang Yan.

    7) Fix warning in IPVS icmp error debugging, from Haishuang Yan.

    8) Fix endianess issue in tcp extension header, from Sergey Marinkevich.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The previous patch allowed device drivers to publish their default
    binding between packet trap policers and packet trap groups. However,
    some users might not be content with this binding and would like to
    change it.

    In case user space passed a packet trap policer identifier when setting
    a packet trap group, invoke the appropriate device driver callback and
    pass the new policer identifier.

    v2:
    * Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
    devlink_trap_group_set() and bail if not present
    * Add extack error message in case trap group was partially modified

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Acked-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Packet trap groups are used to aggregate logically related packet traps.
    Currently, these groups allow user space to batch operations such as
    setting the trap action of all member traps.

    In order to prevent the CPU from being overwhelmed by too many trapped
    packets, it is desirable to bind a packet trap policer to these groups.
    For example, to limit all the packets that encountered an exception
    during routing to 10Kpps.

    Allow device drivers to bind default packet trap policers to packet trap
    groups when the latter are registered with devlink.

    The next patch will enable user space to change this default binding.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Devices capable of offloading the kernel's datapath and perform
    functions such as bridging and routing must also be able to send (trap)
    specific packets to the kernel (i.e., the CPU) for processing.

    For example, a device acting as a multicast-aware bridge must be able to
    trap IGMP membership reports to the kernel for processing by the bridge
    module.

    In most cases, the underlying device is capable of handling packet rates
    that are several orders of magnitude higher compared to those that can
    be handled by the CPU.

    Therefore, in order to prevent the underlying device from overwhelming
    the CPU, devices usually include packet trap policers that are able to
    police the trapped packets to rates that can be handled by the CPU.

    This patch allows capable device drivers to register their supported
    packet trap policers with devlink. User space can then tune the
    parameters of these policer (currently, rate and burst size) and read
    from the device the number of packets that were dropped by the policer,
    if supported.

    Subsequent patches in the series will allow device drivers to create
    default binding between these policers and packet trap groups and allow
    user space to change the binding.

    v2:
    * Add 'strict_start_type' in devlink policy
    * Have device drivers provide max/min rate/burst size for each policer.
    Use them to check validity of user provided parameters

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Avoid taking a reference on listen sockets by checking the socket type
    in the sk_assign and in the corresponding skb_steal_sock() code in the
    the transport layer, and by ensuring that the prefetch free (sock_pfree)
    function uses the same logic to check whether the socket is refcounted.

    Suggested-by: Martin KaFai Lau
    Signed-off-by: Joe Stringer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz

    Joe Stringer
     
  • Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
    the determination of whether the socket is reference counted in the case
    where it is prefetched by earlier logic such as early_demux.

    Signed-off-by: Joe Stringer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz

    Joe Stringer
     
  • Add support for TPROXY via a new bpf helper, bpf_sk_assign().

    This helper requires the BPF program to discover the socket via a call
    to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
    helper takes its own reference to the socket in addition to any existing
    reference that may or may not currently be obtained for the duration of
    BPF processing. For the destination socket to receive the traffic, the
    traffic must be routed towards that socket via local route. The
    simplest example route is below, but in practice you may want to route
    traffic more narrowly (eg by CIDR):

    $ ip route add local default dev lo

    This patch avoids trying to introduce an extra bit into the skb->sk, as
    that would require more invasive changes to all code interacting with
    the socket to ensure that the bit is handled correctly, such as all
    error-handling cases along the path from the helper in BPF through to
    the orphan path in the input. Instead, we opt to use the destructor
    variable to switch on the prefetch of the socket.

    Signed-off-by: Joe Stringer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz

    Joe Stringer
     
  • Pull documentation updates from Jonathan Corbet:
    "This has been a busy cycle for documentation work.

    Highlights include:

    - Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
    Maybe someday we'll get to the end of this stuff...maybe...

    - Some organizational work to bring some order to the core-api
    manual.

    - Various new docs and additions to the existing documentation.

    - Typo fixes, warning fixes, ..."

    * tag 'docs-5.7' of git://git.lwn.net/linux: (123 commits)
    Documentation: x86: exception-tables: document CONFIG_BUILDTIME_TABLE_SORT
    MAINTAINERS: adjust to filesystem doc ReST conversion
    docs: deprecated.rst: Add BUG()-family
    doc: zh_CN: add translation for virtiofs
    doc: zh_CN: index files in filesystems subdirectory
    docs: locking: Drop :c:func: throughout
    docs: locking: Add 'need' to hardirq section
    docs: conf.py: avoid thousands of duplicate label warning on Sphinx
    docs: prevent warnings due to autosectionlabel
    docs: fix reference to core-api/namespaces.rst
    docs: fix pointers to io-mapping.rst and io_ordering.rst files
    Documentation: Better document the softlockup_panic sysctl
    docs: hw-vuln: tsx_async_abort.rst: get rid of an unused ref
    docs: perf: imx-ddr.rst: get rid of a warning
    docs: filesystems: fuse.rst: supress a Sphinx warning
    docs: translations: it: avoid duplicate refs at programming-language.rst
    docs: driver.rst: supress two ReSt warnings
    docs: trace: events.rst: convert some new stuff to ReST format
    Documentation: Add io_ordering.rst to driver-api manual
    Documentation: Add io-mapping.rst to driver-api manual
    ...

    Linus Torvalds
     
  • Pull io_uring updates from Jens Axboe:
    "Here are the io_uring changes for this merge window. Light on new
    features this time around (just splice + buffer selection), lots of
    cleanups, fixes, and improvements to existing support. In particular,
    this contains:

    - Cleanup fixed file update handling for stack fallback (Hillf)

    - Re-work of how pollable async IO is handled, we no longer require
    thread offload to handle that. Instead we rely using poll to drive
    this, with task_work execution.

    - In conjunction with the above, allow expendable buffer selection,
    so that poll+recv (for example) no longer has to be a split
    operation.

    - Make sure we honor RLIMIT_FSIZE for buffered writes

    - Add support for splice (Pavel)

    - Linked work inheritance fixes and optimizations (Pavel)

    - Async work fixes and cleanups (Pavel)

    - Improve io-wq locking (Pavel)

    - Hashed link write improvements (Pavel)

    - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)"

    * tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits)
    io_uring: cleanup io_alloc_async_ctx()
    io_uring: fix missing 'return' in comment
    io-wq: handle hashed writes in chains
    io-uring: drop 'free_pfile' in struct io_file_put
    io-uring: drop completion when removing file
    io_uring: Fix ->data corruption on re-enqueue
    io-wq: close cancel gap for hashed linked work
    io_uring: make spdxcheck.py happy
    io_uring: honor original task RLIMIT_FSIZE
    io-wq: hash dependent work
    io-wq: split hashing and enqueueing
    io-wq: don't resched if there is no work
    io-wq: remove duplicated cancel code
    io_uring: fix truncated async read/readv and write/writev retry
    io_uring: dual license io_uring.h uapi header
    io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
    io_uring: Fix unused function warnings
    io_uring: add end-of-bits marker and build time verify it
    io_uring: provide means of removing buffers
    io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
    ...

    Linus Torvalds