14 Jun, 2020

1 commit

  • Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
    '---help---'"), the number of '---help---' has been gradually
    decreasing, but there are still more than 2400 instances.

    This commit finishes the conversion. While I touched the lines,
    I also fixed the indentation.

    There are a variety of indentation styles found.

    a) 4 spaces + '---help---'
    b) 7 spaces + '---help---'
    c) 8 spaces + '---help---'
    d) 1 space + 1 tab + '---help---'
    e) 1 tab + '---help---' (correct indentation)
    f) 1 tab + 1 space + '---help---'
    g) 1 tab + 2 spaces + '---help---'

    In order to convert all of them to 1 tab + 'help', I ran the
    following commend:

    $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

11 Jun, 2020

2 commits

  • Pull sysctl fixes from Al Viro:
    "Fixups to regressions in sysctl series"

    * 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    sysctl: reject gigantic reads/write to sysctl files
    cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
    trace: fix an incorrect __user annotation on stack_trace_sysctl
    random: fix an incorrect __user annotation on proc_do_entropy
    net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
    net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl

    Linus Torvalds
     
  • Pull READ/WRITE_ONCE rework from Will Deacon:
    "This the READ_ONCE rework I've been working on for a while, which
    bumps the minimum GCC version and improves code-gen on arm64 when
    stack protector is enabled"

    [ Side note: I'm _really_ tempted to raise the minimum gcc version to
    4.9, so that we can just say that we require _Generic() support.

    That would allow us to more cleanly handle a lot of the cases where we
    depend on very complex macros with 'sizeof' or __builtin_choose_expr()
    with __builtin_types_compatible_p() etc.

    This branch has a workaround for sparse not handling _Generic(),
    either, but that was already fixed in the sparse development branch,
    so it's really just gcc-4.9 that we'd require. - Linus ]

    * 'rwonce/rework' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux:
    compiler_types.h: Use unoptimized __unqual_scalar_typeof for sparse
    compiler_types.h: Optimize __unqual_scalar_typeof compilation time
    compiler.h: Enforce that READ_ONCE_NOCHECK() access size is sizeof(long)
    compiler-types.h: Include naked type in __pick_integer_type() match
    READ_ONCE: Fix comment describing 2x32-bit atomicity
    gcov: Remove old GCC 3.4 support
    arm64: barrier: Use '__unqual_scalar_typeof' for acquire/release macros
    locking/barriers: Use '__unqual_scalar_typeof' for load-acquire macros
    READ_ONCE: Drop pointer qualifiers when reading from scalar types
    READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses
    READ_ONCE: Simplify implementations of {READ,WRITE}_ONCE()
    arm64: csum: Disable KASAN for do_csum()
    fault_inject: Don't rely on "return value" from WRITE_ONCE()
    net: tls: Avoid assigning 'const' pointer to non-const pointer
    netfilter: Avoid assigning 'const' pointer to non-const pointer
    compiler/gcc: Raise minimum GCC version for kernel builds to 4.8

    Linus Torvalds
     

10 Jun, 2020

2 commits

  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

09 Jun, 2020

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "The highlights are:

    - OSD/MDS latency and caps cache metrics infrastructure for the
    filesytem (Xiubo Li). Currently available through debugfs and will
    be periodically sent to the MDS in the future.

    - support for replica reads (balanced and localized reads) for rbd
    and the filesystem (myself). The default remains to always read
    from primary, users can opt-in with the new crush_location and
    read_from_replica options. Note that reading from replica is safe
    for general use only since Octopus.

    - support for RADOS allocation hint flags (myself). Currently used by
    rbd to propagate the compressible/incompressible hint given with
    the new compression_hint map option and ready for passing on more
    advanced hints, e.g. based on fadvise() from the filesystem.

    - support for efficient cross-quota-realm renames (Luis Henriques)

    - assorted cap handling improvements and cleanups, particularly
    untangling some of the locking (Jeff Layton)"

    * tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client: (29 commits)
    rbd: compression_hint option
    libceph: support for alloc hint flags
    libceph: read_from_replica option
    libceph: support for balanced and localized reads
    libceph: crush_location infrastructure
    libceph: decode CRUSH device/bucket types and names
    libceph: add non-asserting rbtree insertion helper
    ceph: skip checking caps when session reconnecting and releasing reqs
    ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock
    ceph: don't return -ESTALE if there's still an open file
    libceph, rbd: replace zero-length array with flexible-array
    ceph: allow rename operation under different quota realms
    ceph: normalize 'delta' parameter usage in check_quota_exceeded
    ceph: ceph_kick_flushing_caps needs the s_mutex
    ceph: request expedited service on session's last cap flush
    ceph: convert mdsc->cap_dirty to a per-session list
    ceph: reset i_requested_max_size if file write is not wanted
    ceph: throw a warning if we destroy session with mutex still locked
    ceph: fix potential race in ceph_check_caps
    ceph: document what protects i_dirty_item and i_flushing_item
    ...

    Linus Torvalds
     

08 Jun, 2020

2 commits

  • cpumask_parse_user works on __user pointers, so this is wrong now.

    Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    Reported-by: build test robot
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Pull networking fixes from David Miller:

    - Fix the build with certain Kconfig combinations for the Chelsio
    inline TLS device, from Rohit Maheshwar and Vinay Kumar Yadavi.

    - Fix leak in genetlink, from Cong Lang.

    - Fix out of bounds packet header accesses in seg6, from Ahmed
    Abdelsalam.

    - Two XDP fixes in the ENA driver, from Sameeh Jubran

    - Use rwsem in device rename instead of a seqcount because this code
    can sleep, from Ahmed S. Darwish.

    - Fix WoL regressions in r8169, from Heiner Kallweit.

    - Fix qed crashes in kdump mode, from Alok Prasad.

    - Fix the callbacks used for certain thermal zones in mlxsw, from Vadim
    Pasternak.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
    net: dsa: lantiq_gswip: fix and improve the unsupported interface error
    mlxsw: core: Use different get_trend() callbacks for different thermal zones
    net: dp83869: Reset return variable if PHY strap is read
    rhashtable: Drop raw RCU deref in nested_table_free
    cxgb4: Use kfree() instead kvfree() where appropriate
    net: qed: fixes crash while running driver in kdump kernel
    vsock/vmci: make vmci_vsock_transport_cb() static
    net: ethtool: Fix comment mentioning typo in IS_ENABLED()
    net: phy: mscc: fix Serdes configuration in vsc8584_config_init
    net: mscc: Fix OF_MDIO config check
    net: marvell: Fix OF_MDIO config check
    net: dp83867: Fix OF_MDIO config check
    net: dp83869: Fix OF_MDIO config check
    net: ethernet: mvneta: fix MVNETA_SKB_HEADROOM alignment
    ethtool: linkinfo: remove an unnecessary NULL check
    net/xdp: use shift instead of 64 bit division
    crypto/chtls:Fix compile error when CONFIG_IPV6 is disabled
    inet_connection_sock: clear inet_num out of destroy helper
    yam: fix possible memory leak in yam_init_driver
    lan743x: Use correct MAC_CR configuration for 1 GBit speed
    ...

    Linus Torvalds
     

07 Jun, 2020

1 commit

  • Pull Kbuild updates from Masahiro Yamada:

    - fix warnings in 'make clean' for ARCH=um, hexagon, h8300, unicore32

    - ensure to rebuild all objects when the compiler is upgraded

    - exclude system headers from dependency tracking and fixdep processing

    - fix potential bit-size mismatch between the kernel and BPF user-mode
    helper

    - add the new syntax 'userprogs' to build user-space programs for the
    target architecture (the same arch as the kernel)

    - compile user-space sample code under samples/ for the target arch
    instead of the host arch

    - make headers_install fail if a CONFIG option is leaked to user-space

    - sanitize the output format of scripts/checkstack.pl

    - handle ARM 'push' instruction in scripts/checkstack.pl

    - error out before modpost if a module name conflict is found

    - error out when multiple directories are passed to M= because this
    feature is broken for a long time

    - add CONFIG_DEBUG_INFO_COMPRESSED to support compressed debug info

    - a lot of cleanups of modpost

    - dump vmlinux symbols out into vmlinux.symvers, and reuse it in the
    second pass of modpost

    - do not run the second pass of modpost if nothing in modules is
    updated

    - install modules.builtin(.modinfo) by 'make install' as well as by
    'make modules_install' because it is useful even when
    CONFIG_MODULES=n

    - add new command line variables, GZIP, BZIP2, LZOP, LZMA, LZ4, and XZ
    to allow users to use alternatives such as pigz, pbzip2, etc.

    * tag 'kbuild-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (96 commits)
    kbuild: add variables for compression tools
    Makefile: install modules.builtin even if CONFIG_MODULES=n
    mksysmap: Fix the mismatch of '.L' symbols in System.map
    kbuild: doc: rename LDFLAGS to KBUILD_LDFLAGS
    modpost: change elf_info->size to size_t
    modpost: remove is_vmlinux() helper
    modpost: strip .o from modname before calling new_module()
    modpost: set have_vmlinux in new_module()
    modpost: remove mod->skip struct member
    modpost: add mod->is_vmlinux struct member
    modpost: remove is_vmlinux() call in check_for_{gpl_usage,unused}()
    modpost: remove mod->is_dot_o struct member
    modpost: move -d option in scripts/Makefile.modpost
    modpost: remove -s option
    modpost: remove get_next_text() and make {grab,release_}file static
    modpost: use read_text_file() and get_line() for reading text files
    modpost: avoid false-positive file open error
    modpost: fix potential mmap'ed file overrun in get_src_version()
    modpost: add read_text_file() and get_line() helpers
    modpost: do not call get_modinfo() for vmlinux(.o)
    ...

    Linus Torvalds
     

06 Jun, 2020

4 commits

  • Pull AFS updates from David Howells:
    "There's some core VFS changes which affect a couple of filesystems:

    - Make the inode hash table RCU safe and providing some RCU-safe
    accessor functions. The search can then be done without taking the
    inode_hash_lock. Care must be taken because the object may be being
    deleted and no wait is made.

    - Allow iunique() to avoid taking the inode_hash_lock.

    - Allow AFS's callback processing to avoid taking the inode_hash_lock
    when using the inode table to find an inode to notify.

    - Improve Ext4's time updating. Konstantin Khlebnikov said "For now,
    I've plugged this issue with try-lock in ext4 lazy time update.
    This solution is much better."

    Then there's a set of changes to make a number of improvements to the
    AFS driver:

    - Improve callback (ie. third party change notification) processing
    by:

    (a) Relying more on the fact we're doing this under RCU and by
    using fewer locks. This makes use of the RCU-based inode
    searching outlined above.

    (b) Moving to keeping volumes in a tree indexed by volume ID
    rather than a flat list.

    (c) Making the server and volume records logically part of the
    cell. This means that a server record now points directly at
    the cell and the tree of volumes is there. This removes an N:M
    mapping table, simplifying things.

    - Improve keeping NAT or firewall channels open for the server
    callbacks to reach the client by actively polling the fileserver on
    a timed basis, instead of only doing it when we have an operation
    to process.

    - Improving detection of delayed or lost callbacks by including the
    parent directory in the list of file IDs to be queried when doing a
    bulk status fetch from lookup. We can then check to see if our copy
    of the directory has changed under us without us getting notified.

    - Determine aliasing of cells (such as a cell that is pointed to be a
    DNS alias). This allows us to avoid having ambiguity due to
    apparently different cells using the same volume and file servers.

    - Improve the fileserver rotation to do more probing when it detects
    that all of the addresses to a server are listed as non-responsive.
    It's possible that an address that previously stopped responding
    has become responsive again.

    Beyond that, lay some foundations for making some calls asynchronous:

    - Turn the fileserver cursor struct into a general operation struct
    and hang the parameters off of that rather than keeping them in
    local variables and hang results off of that rather than the call
    struct.

    - Implement some general operation handling code and simplify the
    callers of operations that affect a volume or a volume component
    (such as a file). Most of the operation is now done by core code.

    - Operations are supplied with a table of operations to issue
    different variants of RPCs and to manage the completion, where all
    the required data is held in the operation object, thereby allowing
    these to be called from a workqueue.

    - Put the standard "if (begin), while(select), call op, end" sequence
    into a canned function that just emulates the current behaviour for
    now.

    There are also some fixes interspersed:

    - Don't let the EACCES from ICMP6 mapping reach the user as such,
    since it's confusing as to whether it's a filesystem error. Convert
    it to EHOSTUNREACH.

    - Don't use the epoch value acquired through probing a server. If we
    have two servers with the same UUID but in different cells, it's
    hard to draw conclusions from them having different epoch values.

    - Don't interpret the argument to the CB.ProbeUuid RPC as a
    fileserver UUID and look up a fileserver from it.

    - Deal with servers in different cells having the same UUIDs. In the
    event that a CB.InitCallBackState3 RPC is received, we have to
    break the callback promises for every server record matching that
    UUID.

    - Don't let afs_statfs return values that go below 0.

    - Don't use running fileserver probe state to make server selection
    and address selection decisions on. Only make decisions on final
    state as the running state is cleared at the start of probing"

    Acked-by: Al Viro (fs/inode.c part)

    * tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits)
    afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
    afs: Show more a bit more server state in /proc/net/afs/servers
    afs: Don't use probe running state to make decisions outside probe code
    afs: Fix afs_statfs() to not let the values go below zero
    afs: Fix the by-UUID server tree to allow servers with the same UUID
    afs: Reorganise volume and server trees to be rooted on the cell
    afs: Add a tracepoint to track the lifetime of the afs_volume struct
    afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
    afs: Detect cell aliases 2 - Cells with no root volumes
    afs: Detect cell aliases 1 - Cells with root volumes
    afs: Implement client support for the YFSVL.GetCellName RPC op
    afs: Retain more of the VLDB record for alias detection
    afs: Fix handling of CB.ProbeUuid cache manager op
    afs: Don't get epoch from a server because it may be ambiguous
    afs: Build an abstraction around an "operation" concept
    afs: Rename struct afs_fs_cursor to afs_operation
    afs: Remove the error argument from afs_protocol_error()
    afs: Set error flag rather than return error from file status decode
    afs: Make callback processing more efficient.
    afs: Show more information in /proc/net/afs/servers
    ...

    Linus Torvalds
     
  • Pull rdma updates from Jason Gunthorpe:
    "A more active cycle than most of the recent past, with a few large,
    long discussed works this time.

    The RNBD block driver has been posted for nearly two years now, and
    flowing through RDMA due to it also introducing a new ULP.

    The removal of FMR has been a recurring discussion theme for a long
    time.

    And the usual smattering of features and bug fixes.

    Summary:

    - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa

    - Continuing driver cleanups in bnxt_re, hns

    - Big cleanup of mlx5 QP creation flows

    - More consistent use of src port and flow label when LAG is used and
    a mlx5 implementation

    - Additional set of cleanups for IB CM

    - 'RNBD' network block driver and target. This is a network block
    RDMA device specific to ionos's cloud environment. It brings strong
    multipath and resiliency capabilities.

    - Accelerated IPoIB for HFI1

    - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple
    async fds

    - Support for exchanging the new IBTA defiend ECE data during RDMA CM
    exchanges

    - Removal of the very old and insecure FMR interface from all ULPs
    and drivers. FRWR should be preferred for at least a decade now"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits)
    RDMA/cm: Spurious WARNING triggered in cm_destroy_id()
    RDMA/mlx5: Return ECE DC support
    RDMA/mlx5: Don't rely on FW to set zeros in ECE response
    RDMA/mlx5: Return an error if copy_to_user fails
    IB/hfi1: Use free_netdev() in hfi1_netdev_free()
    RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr()
    RDMA/core: Move and rename trace_cm_id_create()
    IB/hfi1: Fix hfi1_netdev_rx_init() error handling
    RDMA: Remove 'max_map_per_fmr'
    RDMA: Remove 'max_fmr'
    RDMA/core: Remove FMR device ops
    RDMA/rdmavt: Remove FMR memory registration
    RDMA/mthca: Remove FMR support for memory registration
    RDMA/mlx4: Remove FMR support for memory registration
    RDMA/i40iw: Remove FMR leftovers
    RDMA/bnxt_re: Remove FMR leftovers
    RDMA/mlx5: Remove FMR leftovers
    RDMA/core: Remove FMR pool API
    RDMA/rds: Remove FMR support for memory registration
    RDMA/srp: Remove support for FMR memory registration
    ...

    Linus Torvalds
     
  • Fix the following gcc-9.3 warning when building with 'make W=1':
    net/vmw_vsock/vmci_transport.c:2058:6: warning: no previous prototype
    for ‘vmci_vsock_transport_cb’ [-Wmissing-prototypes]
    2058 | void vmci_vsock_transport_cb(bool is_host)
    | ^~~~~~~~~~~~~~~~~~~~~~~

    Fixes: b1bba80a4376 ("vsock/vmci: register vmci_transport only when VMCI guest/host are active")
    Reported-by: kernel test robot
    Signed-off-by: Stefano Garzarella
    Signed-off-by: David S. Miller

    Stefano Garzarella
     
  • This code generates a Smatch warning:

    net/ethtool/linkinfo.c:143 ethnl_set_linkinfo()
    warn: variable dereferenced before check 'info' (see line 119)

    Fortunately, the "info" pointer is never NULL so the check can be
    removed.

    Signed-off-by: Dan Carpenter
    Reviewed-by: Michal Kubecek
    Signed-off-by: David S. Miller

    Dan Carpenter
     

05 Jun, 2020

7 commits

  • 64bit division is kind of expensive, and shift should do the job here.

    Signed-off-by: Pavel Machek (CIP)
    Signed-off-by: David S. Miller

    Pavel Machek
     
  • Clearing the 'inet_num' field is necessary and safe if and
    only if the socket is not bound. The MPTCP protocol calls
    the destroy helper on bound sockets, as tcp_v{4,6}_syn_recv_sock
    completed successfully.

    Move the clearing of such field out of the common code, otherwise
    the MPTCP MP_JOIN error path will find the wrong 'inet_num' value
    on socket disposal, __inet_put_port() will acquire the wrong lock
    and bind_node removal could race with other modifiers possibly
    corrupting the bind hash table.

    Reported-and-tested-by: Christoph Paasch
    Fixes: 729cd6436f35 ("mptcp: cope better with MP_JOIN failure")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Sequence counters write paths are critical sections that must never be
    preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.

    Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
    netdev name retrieval.") handled a deadlock, observed with
    CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
    infinitely spinning: it got scheduled after the seqcount write side
    blocked inside its own critical section.

    To fix that deadlock, among other issues, the commit added a
    cond_resched() inside the read side section. While this will get the
    non-preemptible kernel eventually unstuck, the seqcount reader is fully
    exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.

    The fix is also still broken: if the seqcount reader belongs to a
    real-time scheduling policy, it can spin forever and the kernel will
    livelock.

    Disabling preemption over the seqcount write side critical section will
    not work: inside it are a number of GFP_KERNEL allocations and mutex
    locking through the drivers/base/ :: device_rename() call chain.

    >From all the above, replace the seqcount with a rwsem.

    Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
    Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
    Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
    Cc:
    Reported-by: kbuild test robot [ v1 missing up_read() on error exit ]
    Reported-by: Dan Carpenter [ v1 missing up_read() on error exit ]
    Signed-off-by: Ahmed S. Darwish
    Reviewed-by: Sebastian Andrzej Siewior
    Signed-off-by: David S. Miller

    Ahmed S. Darwish
     
  • The seg6_validate_srh() is used to validate SRH for three cases:

    case1: SRH of data-plane SRv6 packets to be processed by the Linux kernel.
    Case2: SRH of the netlink message received from user-space (iproute2)
    Case3: SRH injected into packets through setsockopt

    In case1, the SRH can be encoded in the Reduced way (i.e., first SID is
    carried in DA only and not represented as SID in the SRH) and the
    seg6_validate_srh() now handles this case correctly.

    In case2 and case3, the SRH shouldn’t be encoded in the Reduced way
    otherwise we lose the first segment (i.e., the first hop).

    The current implementation of the seg6_validate_srh() allow SRH of case2
    and case3 to be encoded in the Reduced way. This leads a slab-out-of-bounds
    problem.

    This patch verifies SRH of case1, case2 and case3. Allowing case1 to be
    reduced while preventing SRH of case2 and case3 from being reduced .

    Reported-by: syzbot+e8c028b62439eac42073@syzkaller.appspotmail.com
    Reported-by: YueHaibing
    Fixes: 0cb7498f234e ("seg6: fix SRH processing to comply with RFC8754")
    Signed-off-by: Ahmed Abdelsalam
    Signed-off-by: David S. Miller

    Ahmed Abdelsalam
     
  • syzbot found the following crash:

    general protection fault, probably for non-canonical address 0xdffffc0000000019: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
    CPU: 1 PID: 7060 Comm: syz-executor394 Not tainted 5.7.0-rc6-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:__tipc_sendstream+0xbde/0x11f0 net/tipc/socket.c:1591
    Code: 00 00 00 00 48 39 5c 24 28 48 0f 44 d8 e8 fa 3e db f9 48 b8 00 00 00 00 00 fc ff df 48 8d bb c8 00 00 00 48 89 fa 48 c1 ea 03 3c 02 00 0f 85 e2 04 00 00 48 8b 9b c8 00 00 00 48 b8 00 00 00
    RSP: 0018:ffffc90003ef7818 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8797fd9d
    RDX: 0000000000000019 RSI: ffffffff8797fde6 RDI: 00000000000000c8
    RBP: ffff888099848040 R08: ffff88809a5f6440 R09: fffffbfff1860b4c
    R10: ffffffff8c305a5f R11: fffffbfff1860b4b R12: ffff88809984857e
    R13: 0000000000000000 R14: ffff888086aa4000 R15: 0000000000000000
    FS: 00000000009b4880(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000140 CR3: 00000000a7fdf000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    tipc_sendstream+0x4c/0x70 net/tipc/socket.c:1533
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xcf/0x120 net/socket.c:672
    ____sys_sendmsg+0x32f/0x810 net/socket.c:2352
    ___sys_sendmsg+0x100/0x170 net/socket.c:2406
    __sys_sendmmsg+0x195/0x480 net/socket.c:2496
    __do_sys_sendmmsg net/socket.c:2525 [inline]
    __se_sys_sendmmsg net/socket.c:2522 [inline]
    __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2522
    do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
    entry_SYSCALL_64_after_hwframe+0x49/0xb3
    RIP: 0033:0x440199
    ...

    This bug was bisected to commit 0a3e060f340d ("tipc: add test for Nagle
    algorithm effectiveness"). However, it is not the case, the trouble was
    from the base in the case of zero data length message sending, we would
    unexpectedly make an empty 'txq' queue after the 'tipc_msg_append()' in
    Nagle mode.

    A similar crash can be generated even without the bisected patch but at
    the link layer when it accesses the empty queue.

    We solve the issues by building at least one buffer to go with socket's
    header and an optional data section that may be empty like what we had
    with the 'tipc_msg_build()'.

    Note: the previous commit 4c21daae3dbc ("tipc: Fix NULL pointer
    dereference in __tipc_sendstream()") is obsoleted by this one since the
    'txq' will be never empty and the check of 'skb != NULL' is unnecessary
    but it is safe anyway.

    Reported-by: syzbot+8eac6d030e7807c21d32@syzkaller.appspotmail.com
    Fixes: c0bceb97db9e ("tipc: add smart nagle feature")
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • There are two kinds of memory leaks in genl_family_rcv_msg_dumpit():

    1. Before we call ops->start(), whenever an error happens, we forget
    to free the memory allocated in genl_family_rcv_msg_dumpit().

    2. When ops->start() fails, the 'info' has been already installed on
    the per socket control block, so we should not free it here. More
    importantly, nlk->cb_running is still false at this point, so
    netlink_sock_destruct() cannot free it either.

    The first kind of memory leaks is easier to resolve, but the second
    one requires some deeper thoughts.

    After reviewing how netfilter handles this, the most elegant solution
    I find is just to use a similar way to allocate the memory, that is,
    moving memory allocations from caller into ops->start(). With this,
    we can solve both kinds of memory leaks: for 1), no memory allocation
    happens before ops->start(); for 2), ops->start() handles its own
    failures and 'info' is installed to the socket control block only
    when success. The only ugliness here is we have to pass all local
    variables on stack via a struct, but this is not hard to understand.

    Alternatively, we can introduce a ops->free() to solve this too,
    but it is overkill as only genetlink has this problem so far.

    Fixes: 1927f41a22a0 ("net: genetlink: introduce dump info struct to be available during dumpit op")
    Reported-by: syzbot+21f04f481f449c8db840@syzkaller.appspotmail.com
    Cc: "Jason A. Donenfeld"
    Cc: Florian Westphal
    Cc: Pablo Neira Ayuso
    Cc: Jiri Pirko
    Cc: YueHaibing
    Cc: Shaochun Chen
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • Pull proc updates from Eric Biederman:
    "This has four sets of changes:

    - modernize proc to support multiple private instances

    - ensure we see the exit of each process tid exactly

    - remove has_group_leader_pid

    - use pids not tasks in posix-cpu-timers lookup

    Alexey updated proc so each mount of proc uses a new superblock. This
    allows people to actually use mount options with proc with no fear of
    messing up another mount of proc. Given the kernel's internal mounts
    of proc for things like uml this was a real problem, and resulted in
    Android's hidepid mount options being ignored and introducing security
    issues.

    The rest of the changes are small cleanups and fixes that came out of
    my work to allow this change to proc. In essence it is swapping the
    pids in de_thread during exec which removes a special case the code
    had to handle. Then updating the code to stop handling that special
    case"

    * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: proc_pid_ns takes super_block as an argument
    remove the no longer needed pid_alive() check in __task_pid_nr_ns()
    posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
    posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
    posix-cpu-timers: Extend rcu_read_lock removing task_struct references
    signal: Remove has_group_leader_pid
    exec: Remove BUG_ON(has_group_leader_pid)
    posix-cpu-timer: Unify the now redundant code in lookup_task
    posix-cpu-timer: Tidy up group_leader logic in lookup_task
    proc: Ensure we see the exit of each process tid exactly once
    rculist: Add hlists_swap_heads_rcu
    proc: Use PIDTYPE_TGID in next_tgid
    Use proc_pid_ns() to get pid_namespace from the proc superblock
    proc: use named enums for better readability
    proc: use human-readable values for hidepid
    docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
    proc: add option to mount only a pids subset
    proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
    proc: allow to mount many instances of proc in one pid namespace
    proc: rename struct proc_fs_info to proc_fs_opts

    Linus Torvalds
     

04 Jun, 2020

3 commits

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     
  • Pull splice updates from Al Viro:
    "Christoph's assorted splice cleanups"

    * 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: rename pipe_buf ->steal to ->try_steal
    fs: make the pipe_buf_operations ->confirm operation optional
    fs: make the pipe_buf_operations ->steal operation optional
    trace: remove tracing_pipe_buf_ops
    pipe: merge anon_pipe_buf*_ops
    fs: simplify do_splice_from
    fs: simplify do_splice_to

    Linus Torvalds
     
  • Pull thread updates from Christian Brauner:
    "We have been discussing using pidfds to attach to namespaces for quite
    a while and the patches have in one form or another already existed
    for about a year. But I wanted to wait to see how the general api
    would be received and adopted.

    This contains the changes to make it possible to use pidfds to attach
    to the namespaces of a process, i.e. they can be passed as the first
    argument to the setns() syscall.

    When only a single namespace type is specified the semantics are
    equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
    equals setns(pidfd, CLONE_NEWNET).

    However, when a pidfd is passed, multiple namespace flags can be
    specified in the second setns() argument and setns() will attach the
    caller to all the specified namespaces all at once or to none of them.

    Specifying 0 is not valid together with a pidfd. Here are just two
    obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

    Allowing to also attach subsets of namespaces supports various
    use-cases where callers setns to a subset of namespaces to retain
    privilege, perform an action and then re-attach another subset of
    namespaces.

    Apart from significantly reducing the number of syscalls needed to
    attach to all currently supported namespaces (eight "open+setns"
    sequences vs just a single "setns()"), this also allows atomic setns
    to a set of namespaces, i.e. either attaching to all namespaces
    succeeds or we fail without having changed anything.

    This is centered around a new internal struct nsset which holds all
    information necessary for a task to switch to a new set of namespaces
    atomically. Fwiw, with this change a pidfd becomes the only token
    needed to interact with a container. I'm expecting this to be
    picked-up by util-linux for nsenter rather soon.

    Associated with this change is a shiny new test-suite dedicated to
    setns() (for pidfds and nsfds alike)"

    * tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests/pidfd: add pidfd setns tests
    nsproxy: attach to namespaces via pidfds
    nsproxy: add struct nsset

    Linus Torvalds
     

03 Jun, 2020

9 commits

  • Pull audit updates from Paul Moore:
    "Summary of the significant patches:

    - Record information about binds/unbinds to the audit multicast
    socket. This helps identify which processes have/had access to the
    information in the audit stream.

    - Cleanup and add some additional information to the netfilter
    configuration events collected by audit.

    - Fix some of the audit error handling code so we don't leak network
    namespace references"

    * tag 'audit-pr-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: add subj creds to NETFILTER_CFG record to
    audit: Replace zero-length array with flexible-array
    audit: make symbol 'audit_nfcfgs' static
    netfilter: add audit table unregister actions
    audit: tidy and extend netfilter_cfg x_tables
    audit: log audit netlink multicast bind and unbind
    audit: fix a net reference leak in audit_list_rules_send()
    audit: fix a net reference leak in audit_send_reply()

    Linus Torvalds
     
  • Now that FMR support is gone, this attribute can be deleted from all
    places.

    Link: https://lore.kernel.org/r/12-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Bernard Metzler
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Use FRWR method for memory registration by default and remove the ancient
    and unsafe FMR method.

    Link: https://lore.kernel.org/r/3-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com
    Signed-off-by: Max Gurtovoy
    Signed-off-by: Jason Gunthorpe

    Max Gurtovoy
     
  • This reverts commit 441870ee4240cf67b5d3ab8e16216a9ff42eb5d6.

    Like the previous patch in this series, we revert the above commit that
    causes similar issues with the 'aead' object.

    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • This reverts commit de058420767df21e2b6b0f3bb36d1616fb962032.

    There is no actual tipc_node refcnt leak as stated in the above commit.
    The refcnt is hold carefully for the case of an asynchronous decryption
    (i.e. -EINPROGRESS/-EBUSY and skb = NULL is returned), so that the node
    object cannot be freed in the meantime. The counter will be re-balanced
    when the operation's callback arrives with the decrypted buffer if any.
    In other cases, e.g. a synchronous crypto the counter will be decreased
    immediately when it is done.

    Now with that commit, a kernel panic will occur when there is no node
    found (i.e. n = NULL) in the 'tipc_rcv()' or a premature release of the
    node object.

    This commit solves the issues by reverting the said commit, but keeping
    one valid case that the 'skb_linearize()' is failed.

    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Tested-by: Hoang Le
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • Add a bpf_csum_level() helper which BPF programs can use in combination
    with bpf_skb_adjust_room() when they pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET
    flag to the latter to avoid falling back to CHECKSUM_NONE.

    The bpf_csum_level() allows to adjust CHECKSUM_UNNECESSARY skb->csum_levels
    via BPF_CSUM_LEVEL_{INC,DEC} which calls __skb_{incr,decr}_checksum_unnecessary()
    on the skb. The helper also allows a BPF_CSUM_LEVEL_RESET which sets the skb's
    csum to CHECKSUM_NONE as well as a BPF_CSUM_LEVEL_QUERY to just return the
    current level. Without this helper, there is no way to otherwise adjust the
    skb->csum_level. I did not add an extra dummy flags as there is plenty of free
    bitspace in level argument itself iff ever needed in future.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Alan Maguire
    Acked-by: Lorenz Bauer
    Link: https://lore.kernel.org/bpf/279ae3717cb3d03c0ffeb511493c93c450a01e1a.1591108731.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • Lorenz recently reported:

    In our TC classifier cls_redirect [0], we use the following sequence of
    helper calls to decapsulate a GUE (basically IP + UDP + custom header)
    encapsulated packet:

    bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
    bpf_redirect(skb->ifindex, BPF_F_INGRESS)

    It seems like some checksums of the inner headers are not validated in
    this case. For example, a TCP SYN packet with invalid TCP checksum is
    still accepted by the network stack and elicits a SYN ACK. [...]

    That is, we receive the following packet from the driver:

    | ETH | IP | UDP | GUE | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

    ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading.
    On this packet we run skb_adjust_room_mac(-encap_len), and get the following:

    | ETH | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

    Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing
    into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is
    turned into a no-op due to CHECKSUM_UNNECESSARY.

    The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally,
    it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does
    not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original
    skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now
    there is no way to adjust the skb->csum_level. NICs that have checksum offload
    disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected.

    Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and
    add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users
    from opting out. Opting out is useful for the case where we don't remove/add
    full protocol headers, or for the case where a user wants to adjust the csum
    level manually e.g. through bpf_csum_level() helper that is added in subsequent
    patch.

    The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF
    bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally
    as well but doesn't change layers, only transitions between v4 to v6 and vice
    versa, therefore no adoption is required there.

    [0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/

    Fixes: 2be7e212d541 ("bpf: add bpf_skb_adjust_room helper")
    Reported-by: Lorenz Bauer
    Reported-by: Alan Maguire
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Alan Maguire
    Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/
    Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Michael Kelley [hyperv]
    Acked-by: Gao Xiang [erofs]
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Wei Liu
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Switch all callers to map_kernel_range, which symmetric to the unmap side
    (as well as the _noflush versions).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Gao Xiang
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Michael Kelley
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Wei Liu
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-17-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

02 Jun, 2020

8 commits

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-06-01

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 55 non-merge commits during the last 1 day(s) which contain
    a total of 91 files changed, 4986 insertions(+), 463 deletions(-).

    The main changes are:

    1) Add rx_queue_mapping to bpf_sock from Amritha.

    2) Add BPF ring buffer, from Andrii.

    3) Attach and run programs through devmap, from David.

    4) Allow SO_BINDTODEVICE opt in bpf_setsockopt, from Ferenc.

    5) link based flow_dissector, from Jakub.

    6) Use tracing helpers for lsm programs, from Jiri.

    7) Several sk_msg fixes and extensions, from John.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Extends support to IPv6 for Inline TLS server.

    Signed-off-by: Vinay Kumar Yadav

    v1->v2:
    - cc'd tcp folks.

    v2->v3:
    - changed EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL()

    Signed-off-by: David S. Miller

    Vinay Kumar Yadav
     
  • Socket option IPV6_ADDRFORM supports UDP/UDPLITE and TCP at present.
    Previously the checking logic looks like:
    if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
    do_some_check;
    else if (sk->sk_protocol != IPPROTO_TCP)
    break;

    After commit b6f6118901d1 ("ipv6: restrict IPV6_ADDRFORM operation"), TCP
    was blocked as the logic changed to:
    if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
    do_some_check;
    else if (sk->sk_protocol == IPPROTO_TCP)
    do_some_check;
    break;
    else
    break;

    Then after commit 82c9ae440857 ("ipv6: fix restrict IPV6_ADDRFORM operation")
    UDP/UDPLITE were blocked as the logic changed to:
    if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
    do_some_check;
    if (sk->sk_protocol == IPPROTO_TCP)
    do_some_check;

    if (sk->sk_protocol != IPPROTO_TCP)
    break;

    Fix it by using Eric's code and simply remove the break in TCP check, which
    looks like:
    if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
    do_some_check;
    else if (sk->sk_protocol == IPPROTO_TCP)
    do_some_check;
    else
    break;

    Fixes: 82c9ae440857 ("ipv6: fix restrict IPV6_ADDRFORM operation")
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     
  • tipc_sendstream() may send zero length packet, then tipc_msg_append()
    do not alloc skb, skb_peek_tail() will get NULL, msg_set_ack_required
    will trigger NULL pointer dereference.

    Reported-by: syzbot+8eac6d030e7807c21d32@syzkaller.appspotmail.com
    Fixes: 0a3e060f340d ("tipc: add test for Nagle algorithm effectiveness")
    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     
  • Move functions to manage BPF programs attached to netns that are not
    specific to flow dissector to a dedicated module named
    bpf/net_namespace.c.

    The set of functions will grow with the addition of bpf_link support for
    netns attached programs. This patch prepares ground by creating a place
    for it.

    This is a code move with no functional changes intended.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200531082846.2117903-4-jakub@cloudflare.com

    Jakub Sitnicki
     
  • In order to:

    (1) attach more than one BPF program type to netns, or
    (2) support attaching BPF programs to netns with bpf_link, or
    (3) support multi-prog attach points for netns

    we will need to keep more state per netns than a single pointer like we
    have now for BPF flow dissector program.

    Prepare for the above by extracting netns_bpf that is part of struct net,
    for storing all state related to BPF programs attached to netns.

    Turn flow dissector callbacks for querying/attaching/detaching a program
    into generic ones that operate on netns_bpf. Next patch will move the
    generic callbacks into their own module.

    This is similar to how it is organized for cgroup with cgroup_bpf.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Cc: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com

    Jakub Sitnicki
     
  • Split out the part of attach callback that happens with attach/detach lock
    acquired. This structures the prog attach callback in a way that opens up
    doors for moving the locking out of flow_dissector and into generic
    callbacks for attaching/detaching progs to netns in subsequent patches.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/20200531082846.2117903-2-jakub@cloudflare.com

    Jakub Sitnicki
     
  • Extending the supported sockopts in bpf_setsockopt with
    SO_BINDTODEVICE. We call sock_bindtoindex with parameter
    lock_sk = false in this context because we already owning
    the socket.

    Signed-off-by: Ferenc Fejes
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/4149e304867b8d5a606a305bc59e29b063e51f49.1590871065.git.fejes@inf.elte.hu

    Ferenc Fejes