02 Dec, 2019

5 commits

  • Pull networking fixes from David Miller:

    1) Fix several scatter gather list issues in kTLS code, from Jakub
    Kicinski.

    2) macb driver device remove has to kill the hresp_err_tasklet. From
    Chuhong Yuan.

    3) Several memory leak and reference count bug fixes in tipc, from Tung
    Nguyen.

    4) Fix mlx5 build error w/o ipv6, from Yue Haibing.

    5) Fix jumbo frame and other regressions in r8169, from Heiner
    Kallweit.

    6) Undo some BUG_ON()'s and replace them with WARN_ON_ONCE and proper
    error propagation/handling. From Paolo Abeni.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (24 commits)
    openvswitch: remove another BUG_ON()
    openvswitch: drop unneeded BUG_ON() in ovs_flow_cmd_build_info()
    net: phy: realtek: fix using paged operations with RTL8105e / RTL8208
    r8169: fix resume on cable plug-in
    r8169: fix jumbo configuration for RTL8168evl
    net: emulex: benet: indent a Kconfig depends continuation line
    selftests: forwarding: fix race between packet receive and tc check
    net: sched: fix `tc -s class show` no bstats on class with nolock subqueues
    net: ethernet: ti: ale: ensure vlan/mdb deleted when no members
    net/mlx5e: Fix build error without IPV6
    selftests: pmtu: use -oneline for ip route list cache
    tipc: fix duplicate SYN messages under link congestion
    tipc: fix wrong timeout input for tipc_wait_for_cond()
    tipc: fix wrong socket reference counter after tipc_sk_timeout() returns
    tipc: fix potential memory leak in __tipc_sendmsg()
    net: macb: add missed tasklet_kill
    selftests: bpf: correct perror strings
    selftests: bpf: test_sockmap: handle file creation failures gracefully
    net/tls: use sg_next() to walk sg entries
    net/tls: remove the dead inplace_crypto code
    ...

    Linus Torvalds
     
  • Pull y2038 cleanups from Arnd Bergmann:
    "y2038 syscall implementation cleanups

    This is a series of cleanups for the y2038 work, mostly intended for
    namespace cleaning: the kernel defines the traditional time_t, timeval
    and timespec types that often lead to y2038-unsafe code. Even though
    the unsafe usage is mostly gone from the kernel, having the types and
    associated functions around means that we can still grow new users,
    and that we may be missing conversions to safe types that actually
    matter.

    There are still a number of driver specific patches needed to get the
    last users of these types removed, those have been submitted to the
    respective maintainers"

    Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/

    * tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (26 commits)
    y2038: alarm: fix half-second cut-off
    y2038: ipc: fix x32 ABI breakage
    y2038: fix typo in powerpc vdso "LOPART"
    y2038: allow disabling time32 system calls
    y2038: itimer: change implementation to timespec64
    y2038: move itimer reset into itimer.c
    y2038: use compat_{get,set}_itimer on alpha
    y2038: itimer: compat handling to itimer.c
    y2038: time: avoid timespec usage in settimeofday()
    y2038: timerfd: Use timespec64 internally
    y2038: elfcore: Use __kernel_old_timeval for process times
    y2038: make ns_to_compat_timeval use __kernel_old_timeval
    y2038: socket: use __kernel_old_timespec instead of timespec
    y2038: socket: remove timespec reference in timestamping
    y2038: syscalls: change remaining timeval to __kernel_old_timeval
    y2038: rusage: use __kernel_old_timeval
    y2038: uapi: change __kernel_time_t to __kernel_old_time_t
    y2038: stat: avoid 'time_t' in 'struct stat'
    y2038: ipc: remove __kernel_time_t reference from headers
    y2038: vdso: powerpc: avoid timespec references
    ...

    Linus Torvalds
     
  • Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
    "As part of the cleanup of some remaining y2038 issues, I came to
    fs/compat_ioctl.c, which still has a couple of commands that need
    support for time64_t.

    In completely unrelated work, I spent time on cleaning up parts of
    this file in the past, moving things out into drivers instead.

    After Al Viro reviewed an earlier version of this series and did a lot
    more of that cleanup, I decided to try to completely eliminate the
    rest of it and move it all into drivers.

    This series incorporates some of Al's work and many patches of my own,
    but in the end stops short of actually removing the last part, which
    is the scsi ioctl handlers. I have patches for those as well, but they
    need more testing or possibly a rewrite"

    * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
    scsi: sd: enable compat ioctls for sed-opal
    pktcdvd: add compat_ioctl handler
    compat_ioctl: move SG_GET_REQUEST_TABLE handling
    compat_ioctl: ppp: move simple commands into ppp_generic.c
    compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
    compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
    compat_ioctl: unify copy-in of ppp filters
    tty: handle compat PPP ioctls
    compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
    compat_ioctl: handle SIOCOUTQNSD
    af_unix: add compat_ioctl support
    compat_ioctl: reimplement SG_IO handling
    compat_ioctl: move WDIOC handling into wdt drivers
    fs: compat_ioctl: move FITRIM emulation into file systems
    gfs2: add compat_ioctl support
    compat_ioctl: remove unused convert_in_user macro
    compat_ioctl: remove last RAID handling code
    compat_ioctl: remove /dev/raw ioctl translation
    compat_ioctl: remove PCI ioctl translation
    compat_ioctl: remove joystick ioctl translation
    ...

    Linus Torvalds
     
  • If we can't build the flow del notification, we can simply delete
    the flow, no need to crash the kernel. Still keep a WARN_ON to
    preserve debuggability.

    Note: the BUG_ON() predates the Fixes tag, but this change
    can be applied only after the mentioned commit.

    v1 -> v2:
    - do not leak an skb on error

    Fixes: aed067783e50 ("openvswitch: Minimize ovs_flow_cmd_del critical section.")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • All the callers of ovs_flow_cmd_build_info() already deal with
    error return code correctly, so we can handle the error condition
    in a more gracefull way. Still dump a warning to preserve
    debuggability.

    v1 -> v2:
    - clarify the commit message
    - clean the skb and report the error (DaveM)

    Fixes: ccb1352e76cf ("net: Add Open vSwitch kernel components.")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

01 Dec, 2019

2 commits

  • Pull Hyper-V updates from Sasha Levin:

    - support for new VMBus protocols (Andrea Parri)

    - hibernation support (Dexuan Cui)

    - latency testing framework (Branden Bonaby)

    - decoupling Hyper-V page size from guest page size (Himadri Pandya)

    * tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (22 commits)
    Drivers: hv: vmbus: Fix crash handler reset of Hyper-V synic
    drivers/hv: Replace binary semaphore with mutex
    drivers: iommu: hyperv: Make HYPERV_IOMMU only available on x86
    HID: hyperv: Add the support of hibernation
    hv_balloon: Add the support of hibernation
    x86/hyperv: Implement hv_is_hibernation_supported()
    Drivers: hv: balloon: Remove dependencies on guest page size
    Drivers: hv: vmbus: Remove dependencies on guest page size
    x86: hv: Add function to allocate zeroed page for Hyper-V
    Drivers: hv: util: Specify ring buffer size using Hyper-V page size
    Drivers: hv: Specify receive buffer size using Hyper-V page size
    tools: hv: add vmbus testing tool
    drivers: hv: vmbus: Introduce latency testing
    video: hyperv: hyperv_fb: Support deferred IO for Hyper-V frame buffer driver
    video: hyperv: hyperv_fb: Obtain screen resolution from Hyper-V host
    hv_netvsc: Add the support of hibernation
    hv_sock: Add the support of hibernation
    video: hyperv_fb: Add the support of hibernation
    scsi: storvsc: Add the support of hibernation
    Drivers: hv: vmbus: Add module parameter to cap the VMBus version
    ...

    Linus Torvalds
     
  • When a classful qdisc's child qdisc has set the flag
    TCQ_F_CPUSTATS (pfifo_fast for example), the child qdisc's
    cpu_bstats should be passed to gnet_stats_copy_basic(),
    but many classful qdisc didn't do that. As a result,
    `tc -s class show dev DEV` always return 0 for bytes and
    packets in this case.

    Pass the child qdisc's cpu_bstats to gnet_stats_copy_basic()
    to fix this issue.

    The qstats also has this problem, but it has been fixed
    in 5dd431b6b9 ("net: sched: introduce and use qstats read...")
    and bstats still remains buggy.

    Fixes: 22e0f8b9322c ("net: sched: make bstats per cpu and estimator RCU safe")
    Signed-off-by: Dust Li
    Signed-off-by: Tony Lu
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Dust Li
     

29 Nov, 2019

10 commits

  • Scenario:
    1. A client socket initiates a SYN message to a listening socket.
    2. The send link is congested, the SYN message is put in the
    send link and a wakeup message is put in wakeup queue.
    3. The congestion situation is abated, the wakeup message is
    pulled out of the wakeup queue. Function tipc_sk_push_backlog()
    is called to send out delayed messages by Nagle. However,
    the client socket is still in CONNECTING state. So, it sends
    the SYN message in the socket write queue to the listening socket
    again.
    4. The listening socket receives the first SYN message and creates
    first server socket. The client socket receives ACK- and establishes
    a connection to the first server socket. The client socket closes
    its connection with the first server socket.
    5. The listening socket receives the second SYN message and creates
    second server socket. The second server socket sends ACK- to the
    client socket, but it has been closed. It results in connection
    reset error when reading from the server socket in user space.

    Solution: return from function tipc_sk_push_backlog() immediately
    if there is pending SYN message in the socket write queue.

    Fixes: c0bceb97db9e ("tipc: add smart nagle feature")
    Signed-off-by: Tung Nguyen
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     
  • In function __tipc_shutdown(), the timeout value passed to
    tipc_wait_for_cond() is not jiffies.

    This commit fixes it by converting that value from milliseconds
    to jiffies.

    Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
    Signed-off-by: Tung Nguyen
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     
  • When tipc_sk_timeout() is executed but user space is grabbing
    ownership, this function rearms itself and returns. However, the
    socket reference counter is not reduced. This causes potential
    unexpected behavior.

    This commit fixes it by calling sock_put() before tipc_sk_timeout()
    returns in the above-mentioned case.

    Fixes: afe8792fec69 ("tipc: refactor function tipc_sk_timeout()")
    Signed-off-by: Tung Nguyen
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     
  • When initiating a connection message to a server side, the connection
    message is cloned and added to the socket write queue. However, if the
    cloning is failed, only the socket write queue is purged. It causes
    memory leak because the original connection message is not freed.

    This commit fixes it by purging the list of connection message when
    it cannot be cloned.

    Fixes: 6787927475e5 ("tipc: buffer overflow handling in listener socket")
    Reported-by: Hoang Le
    Signed-off-by: Tung Nguyen
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     
  • Partially sent record cleanup path increments an SG entry
    directly instead of using sg_next(). This should not be a
    problem today, as encrypted messages should be always
    allocated as arrays. But given this is a cleanup path it's
    easy to miss was this ever to change. Use sg_next(), and
    simplify the code.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Looks like when BPF support was added by commit d3b18ad31f93
    ("tls: add bpf support to sk_msg handling") and
    commit d829e9c4112b ("tls: convert to generic sk_msg interface")
    it broke/removed the support for in-place crypto as added by
    commit 4e6d47206c32 ("tls: Add support for inplace records
    encryption").

    The inplace_crypto member of struct tls_rec is dead, inited
    to zero, and sometimes set to zero again. It used to be
    set to 1 when record was allocated, but the skmsg code doesn't
    seem to have been written with the idea of in-place crypto
    in mind.

    Since non trivial effort is required to bring the feature back
    and we don't really have the HW to measure the benefit just
    remove the left over support for now to avoid confusing readers.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • TLS 1.3 started using the entry at the end of the SG array
    for chaining-in the single byte content type entry. This mostly
    works:

    [ E E E E E E . . ]
    ^ ^
    start end

    E < content type
    /
    [ E E E E E E C . ]
    ^ ^
    start end

    (Where E denotes a populated SG entry; C denotes a chaining entry.)

    If the array is full, however, the end will point to the start:

    [ E E E E E E E E ]
    ^
    start
    end

    And we end up overwriting the start:

    E < content type
    /
    [ C E E E E E E E ]
    ^
    start
    end

    The sg array is supposed to be a circular buffer with start and
    end markers pointing anywhere. In case where start > end
    (i.e. the circular buffer has "wrapped") there is an extra entry
    reserved at the end to chain the two halves together.

    [ E E E E E E . . l ]

    (Where l is the reserved entry for "looping" back to front.

    As suggested by John, let's reserve another entry for chaining
    SG entries after the main circular buffer. Note that this entry
    has to be pointed to by the end entry so its position is not fixed.

    Examples of full messages:

    [ E E E E E E E E . l ]
    ^ ^
    start end


    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • When tls_do_encryption() fails the SG lists are left with the
    SG_END and SG_CHAIN marks in place. One could hope that once
    encryption fails we will never see the record again, but that
    is in fact not true. Commit d3b18ad31f93 ("tls: add bpf support
    to sk_msg handling") added special handling to ENOMEM and ENOSPC
    errors which mean we may see the same record re-submitted.

    As suggested by John free the record, the BPF code is already
    doing just that.

    Reported-by: syzbot+df0d4ec12332661dd1f9@syzkaller.appspotmail.com
    Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • bpf_exec_tx_verdict() may free the record if tls_push_record()
    fails, or if the entire record got consumed by BPF. Re-check
    ctx->open_rec before touching the data.

    Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Pull more io_uring updates from Jens Axboe:
    "As mentioned in the first pull request, there was a later batch as
    well. This contains fixes to the stuff that already went in, cleanups,
    and a few later additions. In particular, this contains:

    - Cleanups/fixes/unification of the submission and completion path
    (Pavel,me)

    - Linked timeouts improvements (Pavel,me)

    - Error path fixes (me)

    - Fix lookup window where cancellations wouldn't work (me)

    - Improve DRAIN support (Pavel)

    - Fix backlog flushing -EBUSY on submit (me)

    - Add support for connect(2) (me)

    - Fix for non-iter based fixed IO (Pavel)

    - creds inheritance for async workers (me)

    - Disable cmsg/ancillary data for sendmsg/recvmsg (me)

    - Shrink io_kiocb to 3 cachelines (me)

    - NUMA fix for io-wq (Jann)"

    * tag 'for-5.5/io_uring-post-20191128' of git://git.kernel.dk/linux-block: (42 commits)
    io_uring: make poll->wait dynamically allocated
    io-wq: shrink io_wq_work a bit
    io-wq: fix handling of NUMA node IDs
    io_uring: use kzalloc instead of kcalloc for single-element allocations
    io_uring: cleanup io_import_fixed()
    io_uring: inline struct sqe_submit
    io_uring: store timeout's sqe->off in proper place
    net: disallow ancillary data for __sys_{send,recv}msg_file()
    net: separate out the msghdr copy from ___sys_{send,recv}msg()
    io_uring: remove superfluous check for sqe->off in io_accept()
    io_uring: async workers should inherit the user creds
    io-wq: have io_wq_create() take a 'data' argument
    io_uring: fix dead-hung for non-iter fixed rw
    io_uring: add support for IORING_OP_CONNECT
    net: add __sys_connect_file() helper
    io_uring: only return -EBUSY for submit on non-flushed backlog
    io_uring: only !null ptr to io_issue_sqe()
    io_uring: simplify io_req_link_next()
    io_uring: pass only !null to io_req_find_next()
    io_uring: remove io_free_req_find_next()
    ...

    Linus Torvalds
     

28 Nov, 2019

4 commits

  • Pull networking fixes from David Miller:
    "This is mostly to fix the iwlwifi regression:

    1) Flush GRO state properly in iwlwifi driver, from Alexander Lobakin.

    2) Validate TIPC link name with properly length macro, from John
    Rutherford.

    3) Fix completion init and device query timeouts in ibmvnic, from
    Thomas Falcon.

    4) Fix SKB size calculation for netlink messages in psample, from
    Nikolay Aleksandrov.

    5) Similar kind of fix for OVS flow dumps, from Paolo Abeni.

    6) Handle queue allocation failure unwind properly in gve driver, we
    could try to release pages we didn't allocate. From Jeroen de
    Borst.

    7) Serialize TX queue SKB list accesses properly in mscc ocelot
    driver. From Yangbo Lu"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net:
    net: usb: aqc111: Use the correct style for SPDX License Identifier
    net: phy: Use the correct style for SPDX License Identifier
    net: wireless: intel: iwlwifi: fix GRO_NORMAL packet stalling
    net: mscc: ocelot: use skb queue instead of skbs list
    net: mscc: ocelot: avoid incorrect consuming in skbs list
    gve: Fix the queue page list allocated pages count
    net: inet_is_local_reserved_port() port arg should be unsigned short
    openvswitch: fix flow command message size
    net: phy: dp83869: Fix return paths to return proper values
    net: psample: fix skb_over_panic
    net: usbnet: Fix -Wcast-function-type
    net: hso: Fix -Wcast-function-type
    net: port < inet_prot_sock(net) --> inet_port_requires_bind_service(net, port)
    ibmvnic: Serialize device queries
    ibmvnic: Bound waits for device queries
    ibmvnic: Terminate waiting device threads after loss of service
    ibmvnic: Fix completion structure initialization
    net-sctp: replace some sock_net(sk) with just 'net'
    net: Fix a documentation bug wrt. ip_unprivileged_port_start
    tipc: fix link name length check

    Linus Torvalds
     
  • Pull driver core updates from Greg KH:
    "Here is the "big" set of driver core patches for 5.5-rc1

    There's a few minor cleanups and fixes in here, but the majority of
    the patches in here fall into two buckets:

    - debugfs api cleanups and fixes

    - driver core device link support for boot dependancy issues

    The debugfs api cleanups are working to slowly refactor the debugfs
    apis so that it is even harder to use incorrectly. That work has been
    happening for the past few kernel releases and will continue over
    time, it's a long-term project/goal

    The driver core device link support missed 5.4 by just a bit, so it's
    been sitting and baking for many months now. It's from Saravana Kannan
    to help resolve the problems that DT-based systems have at boot time
    with dependancy graphs and kernel modules. Turns out that no one has
    actually tried to build a generic arm64 kernel with loads of modules
    and have it "just work" for a variety of platforms (like a distro
    kernel). The big problem turned out to be a lack of dependency
    information between different areas of DT entries, and the work here
    resolves that problem and now allows devices to boot properly, and
    quicker than a monolith kernel.

    All of these patches have been in linux-next for a long time with no
    reported issues"

    * tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (68 commits)
    tracing: Remove unnecessary DEBUG_FS dependency
    of: property: Add device link support for interrupt-parent, dmas and -gpio(s)
    debugfs: Fix !DEBUG_FS debugfs_create_automount
    of: property: Add device link support for "iommu-map"
    of: property: Fix the semantics of of_is_ancestor_of()
    i2c: of: Populate fwnode in of_i2c_get_board_info()
    drivers: base: Fix Kconfig indentation
    firmware_loader: Fix labels with comma for builtin firmware
    driver core: Allow device link operations inside sync_state()
    driver core: platform: Declare ret variable only once
    cpu-topology: declare parse_acpi_topology in
    crypto: hisilicon: no need to check return value of debugfs_create functions
    driver core: platform: use the correct callback type for bus_find_device
    firmware_class: make firmware caching configurable
    driver core: Clarify documentation for fwnode_operations.add_links()
    mailbox: tegra: Fix superfluous IRQ error message
    net: caif: Fix debugfs on 64-bit platforms
    mac80211: Use debugfs_create_xul() helper
    media: c8sectpfe: no need to check return value of debugfs_create functions
    of: property: Add device link support for iommus, mboxes and io-channels
    ...

    Linus Torvalds
     
  • Pull char/misc driver updates from Greg KH:
    "Here is the big set of char/misc and other driver patches for 5.5-rc1

    Loads of different things in here, this feels like the catch-all of
    driver subsystems these days. Full details are in the shortlog, but
    nothing major overall, just lots of driver updates and additions.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (198 commits)
    char: Fix Kconfig indentation, continued
    habanalabs: add more protection of device during reset
    habanalabs: flush EQ workers in hard reset
    habanalabs: make the reset code more consistent
    habanalabs: expose reset counters via existing INFO IOCTL
    habanalabs: make code more concise
    habanalabs: use defines for F/W files
    habanalabs: remove prints on successful device initialization
    habanalabs: remove unnecessary checks
    habanalabs: invalidate MMU cache only once
    habanalabs: skip VA block list update in reset flow
    habanalabs: optimize MMU unmap
    habanalabs: prevent read/write from/to the device during hard reset
    habanalabs: split MMU properties to PCI/DRAM
    habanalabs: re-factor MMU masks and documentation
    habanalabs: type specific MMU cache invalidation
    habanalabs: re-factor memory module code
    habanalabs: export uapi defines to user-space
    habanalabs: don't print error when queues are full
    habanalabs: increase max jobs number to 512
    ...

    Linus Torvalds
     
  • Pull rdma updates from Jason Gunthorpe:
    "Again another fairly quiet cycle with few notable core code changes
    and the usual variety of driver bug fixes and small improvements.

    - Various driver updates and bug fixes for siw, bnxt_re, hns, qedr,
    iw_cxgb4, vmw_pvrdma, mlx5

    - Improvements in SRPT from working with iWarp

    - SRIOV VF support for bnxt_re

    - Skeleton kernel-doc files for drivers/infiniband

    - User visible counters for events related to ODP

    - Common code for tracking of mmap lifetimes so that drivers can link
    HW object liftime to a VMA

    - ODP bug fixes and rework

    - RDMA READ support for efa

    - Removal of the very old cxgb3 driver"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (168 commits)
    RDMA/hns: Delete unnecessary callback functions for cq
    RDMA/hns: Rename the functions used inside creating cq
    RDMA/hns: Redefine the member of hns_roce_cq struct
    RDMA/hns: Redefine interfaces used in creating cq
    RDMA/efa: Expose RDMA read related attributes
    RDMA/efa: Support remote read access in MR registration
    RDMA/efa: Store network attributes in device attributes
    IB/hfi1: remove redundant assignment to variable ret
    RDMA/bnxt_re: Fix missing le16_to_cpu
    RDMA/bnxt_re: Fix stat push into dma buffer on gen p5 devices
    RDMA/bnxt_re: Fix chip number validation Broadcom's Gen P5 series
    RDMA/bnxt_re: Fix Kconfig indentation
    IB/mlx5: Implement callbacks for getting VFs GUID attributes
    IB/ipoib: Add ndo operation for getting VFs GUID attributes
    IB/core: Add interfaces to get VF node and port GUIDs
    net/core: Add support for getting VF GUIDs
    RDMA/qedr: Fix null-pointer dereference when calling rdma_user_mmap_get_offset
    RDMA/cm: Use refcount_t type for refcount variable
    IB/mlx5: Support extended number of strides for Striding RQ
    IB/mlx4: Update HW GID table while adding vlan GID
    ...

    Linus Torvalds
     

27 Nov, 2019

9 commits

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - A comprehensive rewrite of the robust/PI futex code's exit handling
    to fix various exit races. (Thomas Gleixner et al)

    - Rework the generic REFCOUNT_FULL implementation using
    atomic_fetch_* operations so that the performance impact of the
    cmpxchg() loops is mitigated for common refcount operations.

    With these performance improvements the generic implementation of
    refcount_t should be good enough for everybody - and this got
    confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
    REFCOUNT_FULL entirely, leaving the generic implementation enabled
    unconditionally. (Will Deacon)

    - Other misc changes, fixes, cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    lkdtm: Remove references to CONFIG_REFCOUNT_FULL
    locking/refcount: Remove unused 'refcount_error_report()' function
    locking/refcount: Consolidate implementations of refcount_t
    locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
    locking/refcount: Move saturation warnings out of line
    locking/refcount: Improve performance of generic REFCOUNT_FULL code
    locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header
    locking/refcount: Remove unused refcount_*_checked() variants
    locking/refcount: Ensure integer operands are treated as signed
    locking/refcount: Define constants for saturation and max refcount values
    futex: Prevent exit livelock
    futex: Provide distinct return value when owner is exiting
    futex: Add mutex around futex exit
    futex: Provide state handling for exec() as well
    futex: Sanitize exit state handling
    futex: Mark the begin of futex exit explicitly
    futex: Set task::futex_state to DEAD right after handling futex exit
    futex: Split futex_mm_release() for exit/exec
    exit/exec: Seperate mm_release()
    futex: Replace PF_EXITPIDONE with a state
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Dynamic tick (nohz) updates, perhaps most notably changes to force
    the tick on when needed due to lengthy in-kernel execution on CPUs
    on which RCU is waiting.

    - Linux-kernel memory consistency model updates.

    - Replace rcu_swap_protected() with rcu_prepace_pointer().

    - Torture-test updates.

    - Documentation updates.

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
    security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
    bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
    fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
    drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
    drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
    x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
    rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
    rcu: Suppress levelspread uninitialized messages
    rcu: Fix uninitialized variable in nocb_gp_wait()
    rcu: Update descriptions for rcu_future_grace_period tracepoint
    rcu: Update descriptions for rcu_nocb_wake tracepoint
    rcu: Remove obsolete descriptions for rcu_barrier tracepoint
    rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
    workqueue: Convert for_each_wq to use built-in list check
    rcu: Several rcu_segcblist functions can be static
    rcu: Remove unused function hlist_bl_del_init_rcu()
    Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
    ...

    Linus Torvalds
     
  • When user-space sets the OVS_UFID_F_OMIT_* flags, and the relevant
    flow has no UFID, we can exceed the computed size, as
    ovs_nla_put_identifier() will always dump an OVS_FLOW_ATTR_KEY
    attribute.
    Take the above in account when computing the flow command message
    size.

    Fixes: 74ed7ab9264c ("openvswitch: Add support for unique flow IDs.")
    Reported-by: Qi Jun Ding
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • We need to calculate the skb size correctly otherwise we risk triggering
    skb_over_panic[1]. The issue is that data_len is added to the skb in a
    nl attribute, but we don't account for its header size (nlattr 4 bytes)
    and alignment. We account for it when calculating the total size in
    the > PSAMPLE_MAX_PACKET_SIZE comparison correctly, but not when
    allocating after that. The fix is simple - use nla_total_size() for
    data_len when allocating.

    To reproduce:
    $ tc qdisc add dev eth1 clsact
    $ tc filter add dev eth1 egress matchall action sample rate 1 group 1 trunc 129
    $ mausezahn eth1 -b bcast -a rand -c 1 -p 129
    < skb_over_panic BUG(), tail is 4 bytes past skb->end >

    [1] Trace:
    [ 50.459526][ T3480] skbuff: skb_over_panic: text:(____ptrval____) len:196 put:136 head:(____ptrval____) data:(____ptrval____) tail:0xc4 end:0xc0 dev:
    [ 50.474339][ T3480] ------------[ cut here ]------------
    [ 50.481132][ T3480] kernel BUG at net/core/skbuff.c:108!
    [ 50.486059][ T3480] invalid opcode: 0000 [#1] PREEMPT SMP
    [ 50.489463][ T3480] CPU: 3 PID: 3480 Comm: mausezahn Not tainted 5.4.0-rc7 #108
    [ 50.492844][ T3480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
    [ 50.496551][ T3480] RIP: 0010:skb_panic+0x79/0x7b
    [ 50.498261][ T3480] Code: bc 00 00 00 41 57 4c 89 e6 48 c7 c7 90 29 9a 83 4c 8b 8b c0 00 00 00 50 8b 83 b8 00 00 00 50 ff b3 c8 00 00 00 e8 ae ef c0 fe 0b e8 2f df c8 fe 48 8b 55 08 44 89 f6 4c 89 e7 48 c7 c1 a0 22
    [ 50.504111][ T3480] RSP: 0018:ffffc90000447a10 EFLAGS: 00010282
    [ 50.505835][ T3480] RAX: 0000000000000087 RBX: ffff888039317d00 RCX: 0000000000000000
    [ 50.507900][ T3480] RDX: 0000000000000000 RSI: ffffffff812716e1 RDI: 00000000ffffffff
    [ 50.509820][ T3480] RBP: ffffc90000447a60 R08: 0000000000000001 R09: 0000000000000000
    [ 50.511735][ T3480] R10: ffffffff81d4f940 R11: 0000000000000000 R12: ffffffff834a22b0
    [ 50.513494][ T3480] R13: ffffffff82c10433 R14: 0000000000000088 R15: ffffffff838a8084
    [ 50.515222][ T3480] FS: 00007f3536462700(0000) GS:ffff88803eac0000(0000) knlGS:0000000000000000
    [ 50.517135][ T3480] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 50.518583][ T3480] CR2: 0000000000442008 CR3: 000000003b222000 CR4: 00000000000006e0
    [ 50.520723][ T3480] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 50.522709][ T3480] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 50.524450][ T3480] Call Trace:
    [ 50.525214][ T3480] skb_put.cold+0x1b/0x1b
    [ 50.526171][ T3480] psample_sample_packet+0x1d3/0x340
    [ 50.527307][ T3480] tcf_sample_act+0x178/0x250
    [ 50.528339][ T3480] tcf_action_exec+0xb1/0x190
    [ 50.529354][ T3480] mall_classify+0x67/0x90
    [ 50.530332][ T3480] tcf_classify+0x72/0x160
    [ 50.531286][ T3480] __dev_queue_xmit+0x3db/0xd50
    [ 50.532327][ T3480] dev_queue_xmit+0x18/0x20
    [ 50.533299][ T3480] packet_sendmsg+0xee7/0x2090
    [ 50.534331][ T3480] sock_sendmsg+0x54/0x70
    [ 50.535271][ T3480] __sys_sendto+0x148/0x1f0
    [ 50.536252][ T3480] ? tomoyo_file_ioctl+0x23/0x30
    [ 50.537334][ T3480] ? ksys_ioctl+0x5e/0xb0
    [ 50.540068][ T3480] __x64_sys_sendto+0x2a/0x30
    [ 50.542810][ T3480] do_syscall_64+0x73/0x1f0
    [ 50.545383][ T3480] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 50.548477][ T3480] RIP: 0033:0x7f35357d6fb3
    [ 50.551020][ T3480] Code: 48 8b 0d 18 90 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d f9 d3 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 eb f6 ff ff 48 89 04 24
    [ 50.558547][ T3480] RSP: 002b:00007ffe0c7212c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    [ 50.561870][ T3480] RAX: ffffffffffffffda RBX: 0000000001dac010 RCX: 00007f35357d6fb3
    [ 50.565142][ T3480] RDX: 0000000000000082 RSI: 0000000001dac2a2 RDI: 0000000000000003
    [ 50.568469][ T3480] RBP: 00007ffe0c7212f0 R08: 00007ffe0c7212d0 R09: 0000000000000014
    [ 50.571731][ T3480] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000082
    [ 50.574961][ T3480] R13: 0000000001dac2a2 R14: 0000000000000001 R15: 0000000000000003
    [ 50.578170][ T3480] Modules linked in: sch_ingress virtio_net
    [ 50.580976][ T3480] ---[ end trace 61a515626a595af6 ]---

    CC: Yotam Gigi
    CC: Jiri Pirko
    CC: Jamal Hadi Salim
    CC: Simon Horman
    CC: Roopa Prabhu
    Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Only io_uring uses (and added) these, and we want to disallow the
    use of sendmsg/recvmsg for anything but regular data transfers.
    Use the newly added prep helper to split the msghdr copy out from
    the core function, to check for msg_control and msg_controllen
    settings. If either is set, we return -EINVAL.

    Acked-by: David S. Miller
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is in preparation for enabling the io_uring helpers for sendmsg
    and recvmsg to first copy the header for validation before continuing
    with the operation.

    There should be no functional changes in this patch.

    Acked-by: David S. Miller
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Note that the sysctl write accessor functions guarantee that:
    net->ipv4.sysctl_ip_prot_sock ipv4.ip_local_ports.range[0]
    invariant is maintained, and as such the max() in selinux hooks is actually spurious.

    ie. even though
    if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
    per logic is the same as
    if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
    it is actually functionally equivalent to:
    if (snum < low || snum > high) {
    which is equivalent to:
    if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
    even though the first clause is spurious.

    But we want to hold on to it in case we ever want to change what what
    inet_port_requires_bind_service() means (for example by changing
    it from a, by default, [0..1024) range to some sort of set).

    Test: builds, git 'grep inet_prot_sock' finds no other references
    Cc: Eric Dumazet
    Signed-off-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Maciej Żenczykowski
     
  • It already existed in part of the function, but move it
    to a higher level and use it consistently throughout.

    Safe since sk is never written to.

    Signed-off-by: Maciej Żenczykowski
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Maciej Żenczykowski
     
  • In commit 4f07b80c9733 ("tipc: check msg->req data len in
    tipc_nl_compat_bearer_disable") the same patch code was copied into
    routines: tipc_nl_compat_bearer_disable(),
    tipc_nl_compat_link_stat_dump() and tipc_nl_compat_link_reset_stats().
    The two link routine occurrences should have been modified to check
    the maximum link name length and not bearer name length.

    Fixes: 4f07b80c9733 ("tipc: check msg->reg data len in tipc_nl_compat_bearer_disable")
    Signed-off-by: John Rutherford
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    John Rutherford
     

26 Nov, 2019

7 commits

  • Pull networking updates from David Miller:
    "Another merge window, another pull full of stuff:

    1) Support alternative names for network devices, from Jiri Pirko.

    2) Introduce per-netns netdev notifiers, also from Jiri Pirko.

    3) Support MSG_PEEK in vsock/virtio, from Matias Ezequiel Vara
    Larsen.

    4) Allow compiling out the TLS TOE code, from Jakub Kicinski.

    5) Add several new tracepoints to the kTLS code, also from Jakub.

    6) Support set channels ethtool callback in ena driver, from Sameeh
    Jubran.

    7) New SCTP events SCTP_ADDR_ADDED, SCTP_ADDR_REMOVED,
    SCTP_ADDR_MADE_PRIM, and SCTP_SEND_FAILED_EVENT. From Xin Long.

    8) Add XDP support to mvneta driver, from Lorenzo Bianconi.

    9) Lots of netfilter hw offload fixes, cleanups and enhancements,
    from Pablo Neira Ayuso.

    10) PTP support for aquantia chips, from Egor Pomozov.

    11) Add UDP segmentation offload support to igb, ixgbe, and i40e. From
    Josh Hunt.

    12) Add smart nagle to tipc, from Jon Maloy.

    13) Support L2 field rewrite by TC offloads in bnxt_en, from Venkat
    Duvvuru.

    14) Add a flow mask cache to OVS, from Tonghao Zhang.

    15) Add XDP support to ice driver, from Maciej Fijalkowski.

    16) Add AF_XDP support to ice driver, from Krzysztof Kazimierczak.

    17) Support UDP GSO offload in atlantic driver, from Igor Russkikh.

    18) Support it in stmmac driver too, from Jose Abreu.

    19) Support TIPC encryption and auth, from Tuong Lien.

    20) Introduce BPF trampolines, from Alexei Starovoitov.

    21) Make page_pool API more numa friendly, from Saeed Mahameed.

    22) Introduce route hints to ipv4 and ipv6, from Paolo Abeni.

    23) Add UDP segmentation offload to cxgb4, Rahul Lakkireddy"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1857 commits)
    libbpf: Fix usage of u32 in userspace code
    mm: Implement no-MMU variant of vmalloc_user_node_flags
    slip: Fix use-after-free Read in slip_open
    net: dsa: sja1105: fix sja1105_parse_rgmii_delays()
    macvlan: schedule bc_work even if error
    enetc: add support Credit Based Shaper(CBS) for hardware offload
    net: phy: add helpers phy_(un)lock_mdio_bus
    mdio_bus: don't use managed reset-controller
    ax88179_178a: add ethtool_op_get_ts_info()
    mlxsw: spectrum_router: Fix use of uninitialized adjacency index
    mlxsw: spectrum_router: After underlay moves, demote conflicting tunnels
    bpf: Simplify __bpf_arch_text_poke poke type handling
    bpf: Introduce BPF_TRACE_x helper for the tracing tests
    bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JIT
    bpf, testing: Add various tail call test cases
    bpf, x86: Emit patchable direct jump as tail call
    bpf: Constant map key tracking for prog array pokes
    bpf: Add poke dependency tracking for prog array maps
    bpf: Add initial poke descriptor table for jit images
    bpf: Move owner type, jited info into array auxiliary data
    ...

    Linus Torvalds
     
  • Pull crypto updates from Herbert Xu:
    "API:
    - Add library interfaces of certain crypto algorithms for WireGuard
    - Remove the obsolete ablkcipher and blkcipher interfaces
    - Move add_early_randomness() out of rng_mutex

    Algorithms:
    - Add blake2b shash algorithm
    - Add blake2s shash algorithm
    - Add curve25519 kpp algorithm
    - Implement 4 way interleave in arm64/gcm-ce
    - Implement ciphertext stealing in powerpc/spe-xts
    - Add Eric Biggers's scalar accelerated ChaCha code for ARM
    - Add accelerated 32r2 code from Zinc for MIPS
    - Add OpenSSL/CRYPTOGRAMS poly1305 implementation for ARM and MIPS

    Drivers:
    - Fix entropy reading failures in ks-sa
    - Add support for sam9x60 in atmel
    - Add crypto accelerator for amlogic GXL
    - Add sun8i-ce Crypto Engine
    - Add sun8i-ss cryptographic offloader
    - Add a host of algorithms to inside-secure
    - Add NPCM RNG driver
    - add HiSilicon HPRE accelerator
    - Add HiSilicon TRNG driver"

    * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (285 commits)
    crypto: vmx - Avoid weird build failures
    crypto: lib/chacha20poly1305 - use chacha20_crypt()
    crypto: x86/chacha - only unregister algorithms if registered
    crypto: chacha_generic - remove unnecessary setkey() functions
    crypto: amlogic - enable working on big endian kernel
    crypto: sun8i-ce - enable working on big endian
    crypto: mips/chacha - select CRYPTO_SKCIPHER, not CRYPTO_BLKCIPHER
    hwrng: ks-sa - Enable COMPILE_TEST
    crypto: essiv - remove redundant null pointer check before kfree
    crypto: atmel-aes - Change data type for "lastc" buffer
    crypto: atmel-tdes - Set the IV after {en,de}crypt
    crypto: sun4i-ss - fix big endian issues
    crypto: sun4i-ss - hide the Invalid keylen message
    crypto: sun4i-ss - use crypto_ahash_digestsize
    crypto: sun4i-ss - remove dependency on not 64BIT
    crypto: sun4i-ss - Fix 64-bit size_t warnings on sun4i-ss-hash.c
    MAINTAINERS: Add maintainer for HiSilicon SEC V2 driver
    crypto: hisilicon - add DebugFS for HiSilicon SEC
    Documentation: add DebugFS doc for HiSilicon SEC
    crypto: hisilicon - add SRIOV for HiSilicon SEC
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "There are several notable changes here:

    - Single thread migrating itself has been optimized so that it
    doesn't need threadgroup rwsem anymore.

    - Freezer optimization to avoid unnecessary frozen state changes.

    - cgroup ID unification so that cgroup fs ino is the only unique ID
    used for the cgroup and can be used to directly look up live
    cgroups through filehandle interface on 64bit ino archs. On 32bit
    archs, cgroup fs ino is still the only ID in use but it is only
    unique when combined with gen.

    - selftest and other changes"

    * 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
    writeback: fix -Wformat compilation warnings
    docs: cgroup: mm: Fix spelling of "list"
    cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
    cgroup: use cgrp->kn->id as the cgroup ID
    kernfs: use 64bit inos if ino_t is 64bit
    kernfs: implement custom exportfs ops and fid type
    kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
    kernfs: convert kernfs_node->id from union kernfs_node_id to u64
    kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
    kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
    netprio: use css ID instead of cgroup ID
    writeback: use ino_t for inodes in tracepoints
    kernfs: fix ino wrap-around detection
    kselftests: cgroup: Avoid the reuse of fd after it is deallocated
    cgroup: freezer: don't change task and cgroups status unnecessarily
    cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
    cgroup: remove cgroup_enable_task_cg_lists() optimization
    cgroup: pids: use atomic64_t for pids->limit
    selftests: cgroup: Run test_core under interfering stress
    selftests: cgroup: Add task migration tests
    ...

    Linus Torvalds
     
  • This is identical to __sys_connect(), except it takes a struct file
    instead of an fd, and it also allows passing in extra file->f_flags
    flags. The latter is done to support masking in O_NONBLOCK without
    manipulating the original file flags.

    No functional changes in this patch.

    Cc: netdev@vger.kernel.org
    Acked-by: David S. Miller
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Merge in networking bug fixes for merge window.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull io_uring updates from Jens Axboe:
    "A lot of stuff has been going on this cycle, with improving the
    support for networked IO (and hence unbounded request completion
    times) being one of the major themes. There's been a set of fixes done
    this week, I'll send those out as well once we're certain we're fully
    happy with them.

    This contains:

    - Unification of the "normal" submit path and the SQPOLL path (Pavel)

    - Support for sparse (and bigger) file sets, and updating of those
    file sets without needing to unregister/register again.

    - Independently sized CQ ring, instead of just making it always 2x
    the SQ ring size. This makes it more flexible for networked
    applications.

    - Support for overflowed CQ ring, never dropping events but providing
    backpressure on submits.

    - Add support for absolute timeouts, not just relative ones.

    - Support for generic cancellations. This divorces io_uring from
    workqueues as well, which additionally gets us one step closer to
    generic async system call support.

    - With cancellations, we can support grabbing the process file table
    as well, just like we do mm context. This allows support for system
    calls that create file descriptors, like accept4() support that's
    built on top of that.

    - Support for io_uring tracing (Dmitrii)

    - Support for linked timeouts. These abort an operation if it isn't
    completed by the time noted in the linke timeout.

    - Speedup tracking of poll requests

    - Various cleanups making the coder easier to follow (Jackie, Pavel,
    Bob, YueHaibing, me)

    - Update MAINTAINERS with new io_uring list"

    * tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
    io_uring: make POLL_ADD/POLL_REMOVE scale better
    io-wq: remove now redundant struct io_wq_nulls_list
    io_uring: Fix getting file for non-fd opcodes
    io_uring: introduce req_need_defer()
    io_uring: clean up io_uring_cancel_files()
    io-wq: ensure free/busy list browsing see all items
    io-wq: ensure we have a stable view of ->cur_work for cancellations
    io_wq: add get/put_work handlers to io_wq_create()
    io_uring: check for validity of ->rings in teardown
    io_uring: fix potential deadlock in io_poll_wake()
    io_uring: use correct "is IO worker" helper
    io_uring: fix -ENOENT issue with linked timer with short timeout
    io_uring: don't do flush cancel under inflight_lock
    io_uring: flag SQPOLL busy condition to userspace
    io_uring: make ASYNC_CANCEL work with poll and timeout
    io_uring: provide fallback request for OOM situations
    io_uring: convert accept4() -ERESTARTSYS into -EINTR
    io_uring: fix error clear of ->file_table in io_sqe_files_register()
    io_uring: separate the io_free_req and io_free_req_find_next interface
    io_uring: keep io_put_req only responsible for release and put req
    ...

    Linus Torvalds
     
  • In commit 3975b097e577 ("convert stream-like files -> stream_open, even
    if they use noop_llseek") Kirill used a coccinelle script to change
    "nonseekable_open()" to "stream_open()", which changed the trivial cases
    of stream-like file descriptors to the new model with FMODE_STREAM.

    However, the two big cases - sockets and pipes - don't actually have
    that trivial pattern at all, and were thus never converted to
    FMODE_STREAM even though it makes lots of sense to do so.

    That's particularly true when looking forward to the next change:
    getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
    decide whether f_pos updates are needed or not. And if they are, we'll
    always do them atomically.

    This came up because KCSAN (correctly) noted that the non-locked f_pos
    updates are data races: they are clearly benign for the case where we
    don't care, but it would be good to just not have that issue exist at
    all.

    Note that the reason we used FMODE_ATOMIC_POS originally is that only
    doing it for the minimal required case is "safer" in that it's possible
    that the f_pos locking can cause unnecessary serialization across the
    whole write() call. And in the worst case, that kind of serialization
    can cause deadlock issues: think writers that need readers to empty the
    state using the same file descriptor.

    [ Note that the locking is per-file descriptor - because it protects
    "f_pos", which is obviously per-file descriptor - so it only affects
    cases where you literally use the same file descriptor to both read
    and write.

    So a regular pipe that has separate reading and writing file
    descriptors doesn't really have this situation even though it's the
    obvious case of "reader empties what a bit writer concurrently fills"

    But we want to make pipes as being stream-line anyway, because we
    don't want the unnecessary overhead of locking, and because a named
    pipe can be (ab-)used by reading and writing to the same file
    descriptor. ]

    There are likely a lot of other cases that might want FMODE_STREAM, and
    looking for ".llseek = no_llseek" users and other cases that don't have
    an lseek file operation at all and making them use "stream_open()" might
    be a good idea. But pipes and sockets are likely to be the two main
    cases.

    Cc: Kirill Smelkov
    Cc: Eic Dumazet
    Cc: Al Viro
    Cc: Alan Stern
    Cc: Marco Elver
    Cc: Andrea Parri
    Cc: Paul McKenney
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 Nov, 2019

3 commits

  • Danit Goldberg says:

    ====================
    This series extends RTNETLINK to provide IB port and node GUIDs, which
    were configured for Infiniband VFs.

    The functionality to set VF GUIDs already existed for a long time, and
    here we are adding the missing "get" so that netlink will be symmetric and
    various cloud orchestration tools will be able to manage such VFs more
    naturally.

    The iproute2 was extended too to present those GUIDs.

    - ip link show

    For example:
    - ip link set ib4 vf 0 node_guid 22:44:33:00:33:11:00:33
    - ip link set ib4 vf 0 port_guid 10:21:33:12:00:11:22:10
    - ip link show ib4
    ib4: mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
    link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    vf 0 link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff,
    spoof checking off, NODE_GUID 22:44:33:00:33:11:00:33, PORT_GUID 10:21:33:12:00:11:22:10, link-state disable, trust off, query_rss off
    ====================

    Based on the mlx5-next branch from
    git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux for
    dependencies

    * branch 'ib-guids': (35 commits)
    IB/mlx5: Implement callbacks for getting VFs GUID attributes
    IB/ipoib: Add ndo operation for getting VFs GUID attributes
    IB/core: Add interfaces to get VF node and port GUIDs
    net/core: Add support for getting VF GUIDs

    net/mlx5: Add new chain for netfilter flow table offload
    net/mlx5: Refactor creating fast path prio chains
    net/mlx5: Accumulate levels for chains prio namespaces
    net/mlx5: Define fdb tc levels per prio
    net/mlx5: Rename FDB_* tc related defines to FDB_TC_* defines
    net/mlx5: Simplify fdb chain and prio eswitch defines
    IB/mlx5: Load profile according to RoCE enablement state
    IB/mlx5: Rename profile and init methods
    net/mlx5: Handle "enable_roce" devlink param
    net/mlx5: Document flow_steering_mode devlink param
    devlink: Add new "enable_roce" generic device param
    net/mlx5: fix spelling mistake "metdata" -> "metadata"
    net/mlx5: fix kvfree of uninitialized pointer spec
    IB/mlx5: Introduce and use mlx5_core_is_vf()
    net/mlx5: E-switch, Enable metadata on own vport
    net/mlx5: Refactor ingress acl configuration
    ...

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2019-11-24

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 27 non-merge commits during the last 4 day(s) which contain
    a total of 50 files changed, 2031 insertions(+), 548 deletions(-).

    The main changes are:

    1) Optimize bpf_tail_call() from retpoline-ed indirect jump to direct jump,
    from Daniel.

    2) Support global variables in libbpf, from Andrii.

    3) Cleanup selftests with BPF_TRACE_x() macro, from Martin.

    4) Fix devmap hash, from Toke.

    5) Fix register bounds after 32-bit conditional jumps, from Yonghong.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • …etooth/bluetooth-next

    Johan Hedberg says:

    ====================
    pull request: bluetooth-next 2019-11-24

    Here's one last bluetooth-next pull request for the 5.5 kernel:

    - Fix BDADDR_PROPERTY & INVALID_BDADDR quirk handling
    - Added support for BCM4334B0 and BCM4335A0 controllers
    - A few other smaller fixes related to locking and memory leaks
    ====================

    Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

    Jakub Kicinski