22 Dec, 2020

1 commit

  • Pull 9p update from Dominique Martinet:

    - fix long-standing limitation on open-unlink-fop pattern

    - add refcount to p9_fid (fixes the above and will allow for more
    cleanups and simplifications in the future)

    * tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux:
    9p: Remove unnecessary IS_ERR() check
    9p: Uninitialized variable in v9fs_writeback_fid()
    9p: Fix writeback fid incorrectly being attached to dentry
    9p: apply review requests for fid refcounting
    9p: add refcount to p9_fid struct
    fs/9p: search open fids first
    fs/9p: track open fids
    fs/9p: fix create-unlink-getattr idiom

    Linus Torvalds
     

18 Dec, 2020

8 commits

  • Pull networking fixes from Jakub Kicinski:
    "Current release - always broken:

    - net/smc: fix access to parent of an ib device

    - devlink: use _BITUL() macro instead of BIT() in the UAPI header

    - handful of mptcp fixes

    Previous release - regressions:

    - intel: AF_XDP: clear the status bits for the next_to_use descriptor

    - dpaa2-eth: fix the size of the mapped SGT buffer

    Previous release - always broken:

    - mptcp: fix security context on server socket

    - ethtool: fix string set id check

    - ethtool: fix error paths in ethnl_set_channels()

    - lan743x: fix rx_napi_poll/interrupt ping-pong

    - qca: ar9331: fix sleeping function called from invalid context bug"

    * tag 'net-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (32 commits)
    net/sched: sch_taprio: reset child qdiscs before freeing them
    nfp: move indirect block cleanup to flower app stop callback
    octeontx2-af: Fix undetected unmap PF error check
    net: nixge: fix spelling mistake in Kconfig: "Instuments" -> "Instruments"
    qlcnic: Fix error code in probe
    mptcp: fix pending data accounting
    mptcp: push pending frames when subflow has free space
    mptcp: properly annotate nested lock
    mptcp: fix security context on server socket
    net/mlx5: Fix compilation warning for 32-bit platform
    mptcp: clear use_ack and use_map when dropping other suboptions
    devlink: use _BITUL() macro instead of BIT() in the UAPI header
    net: korina: fix return value
    net/smc: fix access to parent of an ib device
    ethtool: fix error paths in ethnl_set_channels()
    nfc: s3fwrn5: Remove unused NCI prop commands
    nfc: s3fwrn5: Remove the delay for NFC sleep
    phy: fix kdoc warning
    tipc: do sanity check payload of a netlink message
    use __netdev_notify_peers in hyperv
    ...

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Features:

    - NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
    support

    - A series of patches to improve readdir performance, particularly
    with large directories

    - Basic support for using NFS/RDMA with the pNFS files and flexfiles
    drivers

    - Micro-optimisations for RDMA

    - RDMA tracing improvements

    Bugfixes:

    - Fix a long standing bug with xs_read_xdr_buf() when receiving
    partial pages (Dan Aloni)

    - Various fixes for getxattr and listxattr, when used over non-TCP
    transports

    - Fixes for containerised NFS from Sargun Dhillon

    - switch nfsiod to be an UNBOUND workqueue (Neil Brown)

    - READDIR should not ask for security label information if there is
    no LSM policy (Olga Kornievskaia)

    - Avoid using interval-based rebinding with TCP in lockd (Calum
    Mackay)

    - A series of RPC and NFS layer fixes to support the NFSv4.2
    READ_PLUS code

    - A couple of fixes for pnfs/flexfiles read failover

    Cleanups:

    - Various cleanups for the SUNRPC xdr code in conjunction with the
    READ_PLUS fixes"

    * tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
    NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
    pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read
    NFSv4/pnfs: Add tracing for the deviceid cache
    fs/lockd: convert comma to semicolon
    NFSv4.2: fix error return on memory allocation failure
    NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
    NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
    NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
    NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
    NFSv4.2: decode_read_plus_hole() needs to check the extent offset
    NFSv4.2: decode_read_plus_data() must skip padding after data segment
    NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
    SUNRPC: When expanding the buffer, we may need grow the sparse pages
    SUNRPC: Cleanup - constify a number of xdr_buf helpers
    SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
    SUNRPC: _copy_to/from_pages() now check for zero length
    SUNRPC: Cleanup xdr_shrink_bufhead()
    SUNRPC: Fix xdr_expand_hole()
    SUNRPC: Fixes for xdr_align_data()
    SUNRPC: _shift_data_left/right_pages should check the shift length
    ...

    Linus Torvalds
     
  • Pull ceph updates from Ilya Dryomov:
    "The big ticket item here is support for msgr2 on-wire protocol, which
    adds the option of full in-transit encryption using AES-GCM algorithm
    (myself).

    On top of that we have a series to avoid intermittent errors during
    recovery with recover_session=clean and some MDS request encoding work
    from Jeff, a cap handling fix and assorted observability improvements
    from Luis and Xiubo and a good number of cleanups.

    Luis also ran into a corner case with quotas which sadly means that we
    are back to denying cross-quota-realm renames"

    * tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits)
    libceph: drop ceph_auth_{create,update}_authorizer()
    libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
    libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
    libceph: introduce connection modes and ms_mode option
    libceph, rbd: ignore addr->type while comparing in some cases
    libceph, ceph: get and handle cluster maps with addrvecs
    libceph: factor out finish_auth()
    libceph: drop ac->ops->name field
    libceph: amend cephx init_protocol() and build_request()
    libceph, ceph: incorporate nautilus cephx changes
    libceph: safer en/decoding of cephx requests and replies
    libceph: more insight into ticket expiry and invalidation
    libceph: move msgr1 protocol specific fields to its own struct
    libceph: move msgr1 protocol implementation to its own file
    libceph: separate msgr1 protocol implementation
    libceph: export remaining protocol independent infrastructure
    libceph: export zero_page
    libceph: rename and export con->flags bits
    libceph: rename and export con->state states
    libceph: make con->state an int
    ...

    Linus Torvalds
     
  • syzkaller shows that packets can still be dequeued while taprio_destroy()
    is running. Let sch_taprio use the reset() function to cancel the advance
    timer and drop all skbs from the child qdiscs.

    Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
    Link: https://syzkaller.appspot.com/bug?id=f362872379bf8f0017fb667c1ab158f2d1e764ae
    Reported-by: syzbot+8971da381fb5a31f542d@syzkaller.appspotmail.com
    Signed-off-by: Davide Caratti
    Acked-by: Vinicius Costa Gomes
    Link: https://lore.kernel.org/r/63b6d79b0e830ebb0283e020db4df3cdfdfb2b94.1608142843.git.dcaratti@redhat.com
    Signed-off-by: Jakub Kicinski

    Davide Caratti
     
  • When sendmsg() needs to wait for memory, the pending data
    is not updated. That causes a drift in forward memory allocation,
    leading to stall and/or warnings at socket close time.

    This change addresses the above issue moving the pending data
    counter update inside the sendmsg() main loop.

    Fixes: 6e628cd3a8f7 ("mptcp: use mptcp release_cb for delayed tasks")
    Reviewed-by: Mat Martineau
    Signed-off-by: Paolo Abeni
    Signed-off-by: Jakub Kicinski

    Paolo Abeni
     
  • When multiple subflows are active, we can receive a
    window update on subflow with no write space available.
    MPTCP will try to push frames on such subflow and will
    fail. Pending frames will be pushed only after receiving
    a window update on a subflow with some wspace available.

    Overall the above could lead to suboptimal aggregate
    bandwidth usage.

    Instead, we should try to push pending frames as soon as
    the subflow reaches both conditions mentioned above.

    We can finally enable self-tests with asymmetric links,
    as the above makes them finally pass.

    Fixes: 6f8a612a33e4 ("mptcp: keep track of advertised windows right edge")
    Reviewed-by: Mat Martineau
    Signed-off-by: Paolo Abeni
    Signed-off-by: Jakub Kicinski

    Paolo Abeni
     
  • MPTCP closes the subflows while holding the msk-level lock.
    While acquiring the subflow socket lock we need to use the
    correct nested annotation, or we can hit a lockdep splat
    at runtime.

    Reported-and-tested-by: Geliang Tang
    Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Paolo Abeni
     
  • Currently MPTCP is not propagating the security context
    from the ingress request socket to newly created msk
    at clone time.

    Address the issue invoking the missing security helper.

    Fixes: cf7da0d66cc1 ("mptcp: Create SUBFLOW socket for incoming connections")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Paolo Abeni
     

17 Dec, 2020

9 commits

  • This patch cleared use_ack and use_map when dropping other suboptions to
    fix the following syzkaller BUG:

    [ 15.223006] BUG: unable to handle page fault for address: 0000000000223b10
    [ 15.223700] #PF: supervisor read access in kernel mode
    [ 15.224209] #PF: error_code(0x0000) - not-present page
    [ 15.224724] PGD b8d5067 P4D b8d5067 PUD c0a5067 PMD 0
    [ 15.225237] Oops: 0000 [#1] SMP
    [ 15.225556] CPU: 0 PID: 7747 Comm: syz-executor Not tainted 5.10.0-rc6+ #24
    [ 15.226281] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    [ 15.227292] RIP: 0010:skb_release_data+0x89/0x1e0
    [ 15.227816] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f e9 02 06 8a ff e8 fd 05 8a ff 45 31 ed 80 7d 02 00 4c 8d 65 30 74 55 e8 eb 05 8a ff 49 8b 1c 24 8b 7b 08 41 f6 c7 01 0f 85 18 01 00 00 e8 d4 05 8a ff 8b 43 34
    [ 15.229669] RSP: 0018:ffffc900019c7c08 EFLAGS: 00010293
    [ 15.230188] RAX: ffff88800daad900 RBX: 0000000000223b08 RCX: 0000000000000006
    [ 15.230895] RDX: 0000000000000000 RSI: ffffffff818e06c5 RDI: ffff88807f6dc700
    [ 15.231593] RBP: ffff88807f71a4c0 R08: 0000000000000001 R09: 0000000000000001
    [ 15.232299] R10: ffffc900019c7c18 R11: 0000000000000000 R12: ffff88807f71a4f0
    [ 15.233007] R13: 0000000000000000 R14: ffff88807f6dc700 R15: 0000000000000002
    [ 15.233714] FS: 00007f65d9b5f700(0000) GS:ffff88807c400000(0000) knlGS:0000000000000000
    [ 15.234509] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 15.235081] CR2: 0000000000223b10 CR3: 000000000b883000 CR4: 00000000000006f0
    [ 15.235788] Call Trace:
    [ 15.236042] skb_release_all+0x28/0x30
    [ 15.236419] __kfree_skb+0x11/0x20
    [ 15.236768] tcp_data_queue+0x270/0x1240
    [ 15.237161] ? tcp_urg+0x50/0x2a0
    [ 15.237496] tcp_rcv_established+0x39a/0x890
    [ 15.237997] ? mark_held_locks+0x49/0x70
    [ 15.238467] tcp_v4_do_rcv+0xb9/0x270
    [ 15.238915] __release_sock+0x8a/0x160
    [ 15.239365] release_sock+0x32/0xd0
    [ 15.239793] __inet_stream_connect+0x1d2/0x400
    [ 15.240313] ? do_wait_intr_irq+0x80/0x80
    [ 15.240791] inet_stream_connect+0x36/0x50
    [ 15.241275] mptcp_stream_connect+0x69/0x1b0
    [ 15.241787] __sys_connect+0x122/0x140
    [ 15.242236] ? syscall_enter_from_user_mode+0x17/0x50
    [ 15.242836] ? lockdep_hardirqs_on_prepare+0xd4/0x170
    [ 15.243436] __x64_sys_connect+0x1a/0x20
    [ 15.243924] do_syscall_64+0x33/0x40
    [ 15.244313] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 15.244821] RIP: 0033:0x7f65d946e469
    [ 15.245183] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
    [ 15.247019] RSP: 002b:00007f65d9b5eda8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
    [ 15.247770] RAX: ffffffffffffffda RBX: 000000000049bf00 RCX: 00007f65d946e469
    [ 15.248471] RDX: 0000000000000010 RSI: 00000000200000c0 RDI: 0000000000000005
    [ 15.249205] RBP: 000000000049bf00 R08: 0000000000000000 R09: 0000000000000000
    [ 15.249908] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049bf0c
    [ 15.250603] R13: 00007fffe8a25cef R14: 00007f65d9b3f000 R15: 0000000000000003
    [ 15.251312] Modules linked in:
    [ 15.251626] CR2: 0000000000223b10
    [ 15.251965] BUG: kernel NULL pointer dereference, address: 0000000000000048
    [ 15.252005] ---[ end trace f5c51fe19123c773 ]---
    [ 15.252822] #PF: supervisor read access in kernel mode
    [ 15.252823] #PF: error_code(0x0000) - not-present page
    [ 15.252825] PGD c6c6067 P4D c6c6067 PUD c0d8067
    [ 15.253294] RIP: 0010:skb_release_data+0x89/0x1e0
    [ 15.253910] PMD 0
    [ 15.253914] Oops: 0000 [#2] SMP
    [ 15.253917] CPU: 1 PID: 7746 Comm: syz-executor Tainted: G D 5.10.0-rc6+ #24
    [ 15.253920] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    [ 15.254435] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f e9 02 06 8a ff e8 fd 05 8a ff 45 31 ed 80 7d 02 00 4c 8d 65 30 74 55 e8 eb 05 8a ff 49 8b 1c 24 8b 7b 08 41 f6 c7 01 0f 85 18 01 00 00 e8 d4 05 8a ff 8b 43 34
    [ 15.254899] RIP: 0010:skb_release_data+0x89/0x1e0
    [ 15.254902] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f e9 02 06 8a ff e8 fd 05 8a ff 45 31 ed 80 7d 02 00 4c 8d 65 30 74 55 e8 eb 05 8a ff 49 8b 1c 24 8b 7b 08 41 f6 c7 01 0f 85 18 01 00 00 e8 d4 05 8a ff 8b 43 34
    [ 15.254905] RSP: 0018:ffffc900019bfc08 EFLAGS: 00010293
    [ 15.255376] RSP: 0018:ffffc900019c7c08 EFLAGS: 00010293
    [ 15.255580]
    [ 15.255583] RAX: ffff888004a7ac80 RBX: 0000000000000040 RCX: 0000000000000000
    [ 15.255912]
    [ 15.256724] RDX: 0000000000000000 RSI: ffffffff818e06c5 RDI: ffff88807f6ddd00
    [ 15.257620] RAX: ffff88800daad900 RBX: 0000000000223b08 RCX: 0000000000000006
    [ 15.259817] RBP: ffff88800e9006c0 R08: 0000000000000000 R09: 0000000000000000
    [ 15.259818] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88800e9006f0
    [ 15.259820] R13: 0000000000000000 R14: ffff88807f6ddd00 R15: 0000000000000002
    [ 15.259822] FS: 00007fae4a60a700(0000) GS:ffff88807c500000(0000) knlGS:0000000000000000
    [ 15.259826] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 15.260296] RDX: 0000000000000000 RSI: ffffffff818e06c5 RDI: ffff88807f6dc700
    [ 15.262514] CR2: 0000000000000048 CR3: 000000000b89c000 CR4: 00000000000006e0
    [ 15.262515] Call Trace:
    [ 15.262519] skb_release_all+0x28/0x30
    [ 15.262523] __kfree_skb+0x11/0x20
    [ 15.263054] RBP: ffff88807f71a4c0 R08: 0000000000000001 R09: 0000000000000001
    [ 15.263680] tcp_data_queue+0x270/0x1240
    [ 15.263843] R10: ffffc900019c7c18 R11: 0000000000000000 R12: ffff88807f71a4f0
    [ 15.264693] ? tcp_urg+0x50/0x2a0
    [ 15.264856] R13: 0000000000000000 R14: ffff88807f6dc700 R15: 0000000000000002
    [ 15.265720] tcp_rcv_established+0x39a/0x890
    [ 15.266438] FS: 00007f65d9b5f700(0000) GS:ffff88807c400000(0000) knlGS:0000000000000000
    [ 15.267283] ? __schedule+0x3fa/0x880
    [ 15.267287] tcp_v4_do_rcv+0xb9/0x270
    [ 15.267290] __release_sock+0x8a/0x160
    [ 15.268049] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 15.268788] release_sock+0x32/0xd0
    [ 15.268791] __inet_stream_connect+0x1d2/0x400
    [ 15.268795] ? do_wait_intr_irq+0x80/0x80
    [ 15.269593] CR2: 0000000000223b10 CR3: 000000000b883000 CR4: 00000000000006f0
    [ 15.270246] inet_stream_connect+0x36/0x50
    [ 15.270250] mptcp_stream_connect+0x69/0x1b0
    [ 15.270253] __sys_connect+0x122/0x140
    [ 15.271097] Kernel panic - not syncing: Fatal exception
    [ 15.271820] ? syscall_enter_from_user_mode+0x17/0x50
    [ 15.283542] ? lockdep_hardirqs_on_prepare+0xd4/0x170
    [ 15.284275] __x64_sys_connect+0x1a/0x20
    [ 15.284853] do_syscall_64+0x33/0x40
    [ 15.285369] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 15.286105] RIP: 0033:0x7fae49f19469
    [ 15.286638] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
    [ 15.289295] RSP: 002b:00007fae4a609da8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
    [ 15.290375] RAX: ffffffffffffffda RBX: 000000000049bf00 RCX: 00007fae49f19469
    [ 15.291403] RDX: 0000000000000010 RSI: 00000000200000c0 RDI: 0000000000000005
    [ 15.292437] RBP: 000000000049bf00 R08: 0000000000000000 R09: 0000000000000000
    [ 15.293456] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049bf0c
    [ 15.294473] R13: 00007fff0004b6bf R14: 00007fae4a5ea000 R15: 0000000000000003
    [ 15.295492] Modules linked in:
    [ 15.295944] CR2: 0000000000000048
    [ 15.296567] Kernel Offset: disabled
    [ 15.296941] ---[ end Kernel panic - not syncing: Fatal exception ]---

    Reported-by: Christoph Paasch
    Fixes: 84dfe3677a6f (mptcp: send out dedicated ADD_ADDR packet)
    Signed-off-by: Geliang Tang
    Reviewed-by: Mat Martineau
    Link: https://lore.kernel.org/r/ccca4e8f01457a1b495c5d612ed16c5f7a585706.1608010058.git.geliangtang@gmail.com
    Signed-off-by: Jakub Kicinski

    Geliang Tang
     
  • Pull rdma updates from Jason Gunthorpe:
    "A smaller set of patches, nothing stands out as being particularly
    major this cycle. The biggest item would be the new HIP09 HW support
    from HNS, otherwise it was pretty quiet for new work here:

    - Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw,
    cxgb4, mlx4 and mlx5

    - Bug fixes and polishing for the new rts ULP

    - Cleanup of uverbs checking for allowed driver operations

    - Use sysfs_emit all over the place

    - Lots of bug fixes and clarity improvements for hns

    - hip09 support for hns

    - NDR and 50/100Gb signaling rates

    - Remove dma_virt_ops and go back to using the IB DMA wrappers

    - mlx5 optimizations for contiguous DMA regions"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (147 commits)
    RDMA/cma: Don't overwrite sgid_attr after device is released
    RDMA/mlx5: Fix MR cache memory leak
    RDMA/rxe: Use acquire/release for memory ordering
    RDMA/hns: Simplify AEQE process for different types of queue
    RDMA/hns: Fix inaccurate prints
    RDMA/hns: Fix incorrect symbol types
    RDMA/hns: Clear redundant variable initialization
    RDMA/hns: Fix coding style issues
    RDMA/hns: Remove unnecessary access right set during INIT2INIT
    RDMA/hns: WARN_ON if get a reserved sl from users
    RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
    RDMA/hns: Do shift on traffic class when using RoCEv2
    RDMA/hns: Normalization the judgment of some features
    RDMA/hns: Limit the length of data copied between kernel and userspace
    RDMA/mlx4: Remove bogus dev_base_lock usage
    RDMA/uverbs: Fix incorrect variable type
    RDMA/core: Do not indicate device ready when device enablement fails
    RDMA/core: Clean up cq pool mechanism
    RDMA/core: Update kernel documentation for ib_create_named_qp()
    MAINTAINERS: SOFT-ROCE: Change Zhu Yanjun's email address
    ...

    Linus Torvalds
     
  • The parent of an ib device is used to retrieve the PCI device
    attributes. It turns out that there are possible cases when an ib device
    has no parent set in the device structure, which may lead to page
    faults when trying to access this memory.
    Fix that by checking the parent pointer and consolidate the pci device
    specific processing in a new function.

    Fixes: a3db10efcc4c ("net/smc: Add support for obtaining SMCR device list")
    Reported-by: syzbot+600fef7c414ee7e2d71b@syzkaller.appspotmail.com
    Signed-off-by: Karsten Graul
    Link: https://lore.kernel.org/r/20201215091058.49354-2-kgraul@linux.ibm.com
    Signed-off-by: Jakub Kicinski

    Karsten Graul
     
  • Fix two error paths in ethnl_set_channels() to avoid lock-up caused
    but unreleased RTNL.

    Fixes: e19c591eafad ("ethtool: set device channel counts with CHANNELS_SET request")
    Reported-by: LiLiang
    Signed-off-by: Ivan Vecera
    Reviewed-by: Michal Kubecek
    Link: https://lore.kernel.org/r/20201215090810.801777-1-ivecera@redhat.com
    Signed-off-by: Jakub Kicinski

    Ivan Vecera
     
  • When we initialize nlmsghdr with no payload inside tipc_nl_compat_dumpit()
    the parsing function returns -EINVAL. We fix it by making the parsing call
    conditional.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Link: https://lore.kernel.org/r/20201215033151.76139-1-hoang.h.le@dektech.com.au
    Signed-off-by: Jakub Kicinski

    Hoang Le
     
  • Pull io_uring updates from Jens Axboe:
    "Fairly light set of changes this time around, and mostly some bits
    that were pushed out to 5.11 instead of 5.10, fixes/cleanups, and a
    few features. In particular:

    - Cleanups around iovec import (David Laight, Pavel)

    - Add timeout support for io_uring_enter(2), which enables us to
    clean up liburing and avoid a timeout sqe submission in the
    completion path.

    The big win here is that it allows setups that split SQ and CQ
    handling into separate threads to avoid locking, as the CQ side
    will no longer submit when timeouts are needed when waiting for
    events (Hao Xu)

    - Add support for socket shutdown, and renameat/unlinkat.

    - SQPOLL cleanups and improvements (Xiaoguang Wang)

    - Allow SQPOLL setups for CAP_SYS_NICE, and enable regular
    (non-fixed) files to be used.

    - Cancelation improvements (Pavel)

    - Fixed file reference improvements (Pavel)

    - IOPOLL related race fixes (Pavel)

    - Lots of other little fixes and cleanups (mostly Pavel)"

    * tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block: (43 commits)
    io_uring: fix io_cqring_events()'s noflush
    io_uring: fix racy IOPOLL flush overflow
    io_uring: fix racy IOPOLL completions
    io_uring: always let io_iopoll_complete() complete polled io
    io_uring: add timeout update
    io_uring: restructure io_timeout_cancel()
    io_uring: fix files cancellation
    io_uring: use bottom half safe lock for fixed file data
    io_uring: fix miscounting ios_left
    io_uring: change submit file state invariant
    io_uring: check kthread stopped flag when sq thread is unparked
    io_uring: share fixed_file_refs b/w multiple rsrcs
    io_uring: replace inflight_wait with tctx->wait
    io_uring: don't take fs for recvmsg/sendmsg
    io_uring: only wake up sq thread while current task is in io worker context
    io_uring: don't acquire uring_lock twice
    io_uring: initialize 'timeout' properly in io_sq_thread()
    io_uring: refactor io_sq_thread() handling
    io_uring: always batch cancel in *cancel_files()
    io_uring: pass files into kill timeouts/poll
    ...

    Linus Torvalds
     
  • There are some use cases for netdev_notify_peers in the context
    when rtnl lock is already held. Introduce lockless version
    of netdev_notify_peers call to save the extra code to call
    call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
    call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
    After that, convert netdev_notify_peers to call the new helper.

    Suggested-by: Nathan Lynch
    Signed-off-by: Lijun Pan
    Signed-off-by: Jakub Kicinski

    Lijun Pan
     
  • Syzbot reported a shift of a u32 by more than 31 in strset_parse_request()
    which is undefined behavior. This is caused by range check of string set id
    using variable ret (which is always 0 at this point) instead of id (string
    set id from request).

    Fixes: 71921690f974 ("ethtool: provide string sets with STRSET_GET request")
    Reported-by: syzbot+96523fb438937cd01220@syzkaller.appspotmail.com
    Signed-off-by: Michal Kubecek
    Link: https://lore.kernel.org/r/b54ed5c5fd972a59afea3e1badfb36d86df68799.1607952208.git.mkubecek@suse.cz
    Signed-off-by: Jakub Kicinski

    Michal Kubecek
     
  • Pull selinux updates from Paul Moore:
    "While we have a small number of SELinux patches for v5.11, there are a
    few changes worth highlighting:

    - Change the LSM network hooks to pass flowi_common structs instead
    of the parent flowi struct as the LSMs do not currently need the
    full flowi struct and they do not have enough information to use it
    safely (missing information on the address family).

    This patch was discussed both with Herbert Xu (representing team
    netdev) and James Morris (representing team
    LSMs-other-than-SELinux).

    - Fix how we handle errors in inode_doinit_with_dentry() so that we
    attempt to properly label the inode on following lookups instead of
    continuing to treat it as unlabeled.

    - Tweak the kernel logic around allowx, auditallowx, and dontauditx
    SELinux policy statements such that the auditx/dontauditx are
    effective even without the allowx statement.

    Everything passes our test suite"

    * tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
    selinux: Fix fall-through warnings for Clang
    selinux: drop super_block backpointer from superblock_security_struct
    selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
    selinux: allow dontauditx and auditallowx rules to take effect without allowx
    selinux: fix error initialization in inode_doinit_with_dentry()

    Linus Torvalds
     

16 Dec, 2020

3 commits

  • Pull nfsd updates from Chuck Lever:
    "Several substantial changes this time around:

    - Previously, exporting an NFS mount via NFSD was considered to be an
    unsupported feature. With v5.11, the community has attempted to
    make re-exporting a first-class feature of NFSD.

    This would enable the Linux in-kernel NFS server to be used as an
    intermediate cache for a remotely-located primary NFS server, for
    example, even with other NFS server implementations, like a NetApp
    filer, as the primary.

    - A short series of patches brings support for multiple RPC/RDMA data
    chunks per RPC transaction to the Linux NFS server's RPC/RDMA
    transport implementation.

    This is a part of the RPC/RDMA spec that the other premiere
    NFS/RDMA implementation (Solaris) has had for a very long time, and
    completes the implementation of RPC/RDMA version 1 in the Linux
    kernel's NFS server.

    - Long ago, NFSv4 support was introduced to NFSD using a series of C
    macros that hid dprintk's and goto's. Over time, the kernel's XDR
    implementation has been greatly improved, but these C macros have
    remained and become fallow. A series of patches in this pull
    request completely replaces those macros with the use of current
    kernel XDR infrastructure. Benefits include:

    - More robust input sanitization in NFSD's NFSv4 XDR decoders.

    - Make it easier to use common kernel library functions that use
    XDR stream APIs (for example, GSS-API).

    - Align the structure of the source code with the RFCs so it is
    easier to learn, verify, and maintain our XDR implementation.

    - Removal of more than a hundred hidden dprintk() call sites.

    - Removal of some explicit manipulation of pages to help make the
    eventual transition to xdr->bvec smoother.

    - On top of several related fixes in 5.10-rc, there are a few more
    fixes to get the Linux NFSD implementation of NFSv4.2 inter-server
    copy up to speed.

    And as usual, there is a pinch of seasoning in the form of a
    collection of unrelated minor bug fixes and clean-ups.

    Many thanks to all who contributed this time around!"

    * tag 'nfsd-5.11' of git://git.linux-nfs.org/projects/cel/cel-2.6: (131 commits)
    nfsd: Record NFSv4 pre/post-op attributes as non-atomic
    nfsd: Set PF_LOCAL_THROTTLE on local filesystems only
    nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
    exportfs: Add a function to return the raw output from fh_to_dentry()
    nfsd: close cached files prior to a REMOVE or RENAME that would replace target
    nfsd: allow filesystems to opt out of subtree checking
    nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
    Revert "nfsd4: support change_attr_type attribute"
    nfsd4: don't query change attribute in v2/v3 case
    nfsd: minor nfsd4_change_attribute cleanup
    nfsd: simplify nfsd4_change_info
    nfsd: only call inode_query_iversion in the I_VERSION case
    nfs_common: need lock during iterate through the list
    NFSD: Fix 5 seconds delay when doing inter server copy
    NFSD: Fix sparse warning in nfs4proc.c
    SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall
    sunrpc: clean-up cache downcall
    nfsd: Fix message level for normal termination
    NFSD: Remove macros that are no longer used
    NFSD: Replace READ* macros in nfsd4_decode_compound()
    ...

    Linus Torvalds
     
  • NFSoRDmA Client updates for Linux 5.11

    Cleanups and improvements:
    - Remove use of raw kernel memory addresses in tracepoints
    - Replace dprintk() call sites in ERR_CHUNK path
    - Trace unmap sync calls
    - Optimize MR DMA-unmapping

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Pull networking updates from Jakub Kicinski:
    "Core:

    - support "prefer busy polling" NAPI operation mode, where we defer
    softirq for some time expecting applications to periodically busy
    poll

    - AF_XDP: improve efficiency by more batching and hindering the
    adjacency cache prefetcher

    - af_packet: make packet_fanout.arr size configurable up to 64K

    - tcp: optimize TCP zero copy receive in presence of partial or
    unaligned reads making zero copy a performance win for much smaller
    messages

    - XDP: add bulk APIs for returning / freeing frames

    - sched: support fragmenting IP packets as they come out of conntrack

    - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

    BPF:

    - BPF switch from crude rlimit-based to memcg-based memory accounting

    - BPF type format information for kernel modules and related tracing
    enhancements

    - BPF implement task local storage for BPF LSM

    - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
    bpf_sk_storage

    Protocols:

    - mptcp: improve multiple xmit streams support, memory accounting and
    many smaller improvements

    - TLS: support CHACHA20-POLY1305 cipher

    - seg6: add support for SRv6 End.DT4/DT6 behavior

    - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

    - ppp_generic: add ability to bridge channels directly

    - bridge: Connectivity Fault Management (CFM) support as is defined
    in IEEE 802.1Q section 12.14.

    Drivers:

    - mlx5: make use of the new auxiliary bus to organize the driver
    internals

    - mlx5: more accurate port TX timestamping support

    - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
    the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging

    - rtw88: major bluetooth co-existance improvements

    - iwlwifi: support new 6 GHz frequency band

    - ath11k: Fast Initial Link Setup (FILS)

    - mt7915: dual band concurrent (DBDC) support

    - net: ipa: add basic support for IPA v4.5

    Refactor:

    - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
    Siewior

    - phy: add support for shared interrupts; get rid of multiple driver
    APIs and have the drivers write a full IRQ handler, slight growth
    of driver code should be compensated by the simpler API which also
    allows shared IRQs

    - add common code for handling netdev per-cpu counters

    - move TX packet re-allocation from Ethernet switch tag drivers to a
    central place

    - improve efficiency and rename nla_strlcpy

    - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot

    Old code removal:

    - wan: delete the DLCI / SDLA drivers

    - wimax: move to staging

    - wifi: remove old WDS wifi bridging support"

    * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
    net: hns3: fix expression that is currently always true
    net: fix proc_fs init handling in af_packet and tls
    nfc: pn533: convert comma to semicolon
    af_vsock: Assign the vsock transport considering the vsock address flags
    af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
    vsock_addr: Check for supported flag values
    vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
    vm_sockets: Add flags field in the vsock address data structure
    net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
    tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
    net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
    nfc: s3fwrn5: Release the nfc firmware
    net: vxget: clean up sparse warnings
    mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
    mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
    mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
    mlxsw: reg: Add Router LPM Cache Enable Register
    mlxsw: reg: Add Router LPM Cache ML Delete Register
    mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
    mlxsw: reg: Add XM Router M Table Register
    ...

    Linus Torvalds
     

15 Dec, 2020

19 commits

  • proc_fs was used, in af_packet, without a surrounding #ifdef,
    although there is no hard dependency on proc_fs.
    That caused the initialization of the af_packet module to fail
    when CONFIG_PROC_FS=n.

    Specifically, proc_create_net() was used in af_packet.c,
    and when it fails, packet_net_init() returns -ENOMEM.
    It will always fail when the kernel is compiled without proc_fs,
    because, proc_create_net() for example always returns NULL.

    The calling order that starts in af_packet.c is as follows:
    packet_init()
    register_pernet_subsys()
    register_pernet_operations()
    __register_pernet_operations()
    ops_init()
    ops->init() (packet_net_ops.init=packet_net_init())
    proc_create_net()

    It worked in the past because register_pernet_subsys()'s return value
    wasn't checked before this Commit 36096f2f4fa0 ("packet: Fix error path in
    packet_init.").
    It always returned an error, but was not checked before, so everything
    was working even when CONFIG_PROC_FS=n.

    The fix here is simply to add the necessary #ifdef.

    This also fixes a similar error in tls_proc.c, that was found by Jakub
    Kicinski.

    Fixes: d26b698dd3cd ("net/tls: add skeleton of MIB statistics")
    Fixes: 36096f2f4fa0 ("packet: Fix error path in packet_init")
    Signed-off-by: Yonatan Linik
    Signed-off-by: Jakub Kicinski

    Yonatan Linik
     
  • The vsock flags field can be set in the connect path (user space app)
    and the (listen) receive path (kernel space logic).

    When the vsock transport is assigned, the remote CID is used to
    distinguish between types of connection.

    Use the vsock flags value (in addition to the CID) from the remote
    address to decide which vsock transport to assign. For the sibling VMs
    use case, all the vsock packets need to be forwarded to the host, so
    always assign the guest->host transport if the VMADDR_FLAG_TO_HOST flag
    is set. For the other use cases, the vsock transport assignment logic is
    not changed.

    Changelog

    v3 -> v4

    * Update the "remote_flags" local variable type to reflect the change of
    the "svm_flags" field to be 1 byte in size.

    v2 -> v3

    * Update bitwise check logic to not compare result to the flag value.

    v1 -> v2

    * Use bitwise operator to check the vsock flag.
    * Use the updated "VMADDR_FLAG_TO_HOST" flag naming.
    * Merge the checks for the g2h transport assignment in one "if" block.

    Signed-off-by: Andra Paraschiv
    Reviewed-by: Stefano Garzarella
    Signed-off-by: Jakub Kicinski

    Andra Paraschiv
     
  • The vsock flags can be set during the connect() setup logic, when
    initializing the vsock address data structure variable. Then the vsock
    transport is assigned, also considering this flags field.

    The vsock transport is also assigned on the (listen) receive path. The
    flags field needs to be set considering the use case.

    Set the value of the vsock flags of the remote address to the one
    targeted for packets forwarding to the host, if the following conditions
    are met:

    * The source CID of the packet is higher than VMADDR_CID_HOST.
    * The destination CID of the packet is higher than VMADDR_CID_HOST.

    Changelog

    v3 -> v4

    * No changes.

    v2 -> v3

    * No changes.

    v1 -> v2

    * Set the vsock flag on the receive path in the vsock transport
    assignment logic.
    * Use bitwise operator for the vsock flag setup.
    * Use the updated "VMADDR_FLAG_TO_HOST" flag naming.

    Signed-off-by: Andra Paraschiv
    Reviewed-by: Stefano Garzarella
    Signed-off-by: Jakub Kicinski

    Andra Paraschiv
     
  • Check if the provided flags value from the vsock address data structure
    includes the supported flags in the corresponding kernel version.

    The first byte of the "svm_zero" field is used as "svm_flags", so add
    the flags check instead.

    Changelog

    v3 -> v4

    * New patch in v4.

    Signed-off-by: Andra Paraschiv
    Reviewed-by: Stefano Garzarella
    Signed-off-by: Jakub Kicinski

    Andra Paraschiv
     
  • With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be
    logically done when HW_CSUM offload is off.

    Fixes: 2342a8512a1e ("net: Add TLS TX offload features")
    Signed-off-by: Tariq Toukan
    Reviewed-by: Boris Pismenny
    Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com
    Signed-off-by: Jakub Kicinski

    Tariq Toukan
     
  • There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
    message in the case of IPv6 or a fragmentation request in the case of
    IPv4. This results in the socket stalling for a second or more as it does
    not respond to the message by retransmitting the SYN frame.

    Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
    ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
    makes use of the entire MSS. In the case of fastopen it does, and an
    additional complication is that the retransmit queue doesn't contain the
    original frames. As a result when tcp_simple_retransmit is called and
    walks the list of frames in the queue it may not mark the frames as lost
    because both the SYN and the data packet each individually are smaller than
    the MSS size after the adjustment. This results in the socket being stalled
    until the retransmit timer kicks in and forces the SYN frame out again
    without the data attached.

    In order to resolve this we can reduce the MSS the packets are compared
    to in tcp_simple_retransmit to -1 for cases where we are still in the
    TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of
    the packets related to the fastopen SYN as lost.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Link: https://lore.kernel.org/r/160780498125.3272.15437756269539236825.stgit@localhost.localdomain
    Signed-off-by: Jakub Kicinski

    Alexander Duyck
     
  • syzbot reproduces BUG_ON in skb_checksum_help():
    tun creates (bogus) skb with huge partial-checksummed area and
    small ip packet inside. Then ip_rcv trims the skb based on size
    of internal ip packet, after that csum offset points beyond of
    trimmed skb. Then checksum_tg() called via netfilter hook
    triggers BUG_ON:

    offset = skb_checksum_start_offset(skb);
    BUG_ON(offset >= skb_headlen(skb));

    To work around the problem this patch forces pskb_trim_rcsum_slow()
    to return -EINVAL in described scenario. It allows its callers to
    drop such kind of packets.

    Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0
    Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com
    Signed-off-by: Vasily Averin
    Acked-by: Willem de Bruijn
    Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com
    Signed-off-by: Jakub Kicinski

    Vasily Averin
     
  • Pull scheduler updates from Thomas Gleixner:

    - migrate_disable/enable() support which originates from the RT tree
    and is now a prerequisite for the new preemptible kmap_local() API
    which aims to replace kmap_atomic().

    - A fair amount of topology and NUMA related improvements

    - Improvements for the frequency invariant calculations

    - Enhanced robustness for the global CPU priority tracking and decision
    making

    - The usual small fixes and enhancements all over the place

    * tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
    sched/fair: Trivial correction of the newidle_balance() comment
    sched/fair: Clear SMT siblings after determining the core is not idle
    sched: Fix kernel-doc markup
    x86: Print ratio freq_max/freq_base used in frequency invariance calculations
    x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
    x86, sched: Calculate frequency invariance for AMD systems
    irq_work: Optimize irq_work_single()
    smp: Cleanup smp_call_function*()
    irq_work: Cleanup
    sched: Limit the amount of NUMA imbalance that can exist at fork time
    sched/numa: Allow a floating imbalance between NUMA nodes
    sched: Avoid unnecessary calculation of load imbalance at clone time
    sched/numa: Rename nr_running and break out the magic number
    sched: Make migrate_disable/enable() independent of RT
    sched/topology: Condition EAS enablement on FIE support
    arm64: Rebuild sched domains on invariance status changes
    sched/topology,schedutil: Wrap sched domains rebuild
    sched/uclamp: Allow to reset a task uclamp constraint value
    sched/core: Fix typos in comments
    Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
    ...

    Linus Torvalds
     
  • I got a warining report:

    br_sysfs_addbr: can't create group bridge4/bridge
    ------------[ cut here ]------------
    sysfs group 'bridge' not found for kobject 'bridge4'
    WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group fs/sysfs/group.c:279 [inline]
    WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group+0x153/0x1b0 fs/sysfs/group.c:270
    Modules linked in: iptable_nat
    ...
    Call Trace:
    br_dev_delete+0x112/0x190 net/bridge/br_if.c:384
    br_dev_newlink net/bridge/br_netlink.c:1381 [inline]
    br_dev_newlink+0xdb/0x100 net/bridge/br_netlink.c:1362
    __rtnl_newlink+0xe11/0x13f0 net/core/rtnetlink.c:3441
    rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500
    rtnetlink_rcv_msg+0x385/0x980 net/core/rtnetlink.c:5562
    netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2494
    netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
    netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1330
    netlink_sendmsg+0x793/0xc80 net/netlink/af_netlink.c:1919
    sock_sendmsg_nosec net/socket.c:651 [inline]
    sock_sendmsg+0x139/0x170 net/socket.c:671
    ____sys_sendmsg+0x658/0x7d0 net/socket.c:2353
    ___sys_sendmsg+0xf8/0x170 net/socket.c:2407
    __sys_sendmsg+0xd3/0x190 net/socket.c:2440
    do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    In br_device_event(), if the bridge sysfs fails to be added,
    br_device_event() should return error. This can prevent warining
    when removing bridge sysfs that do not exist.

    Fixes: bb900b27a2f4 ("bridge: allow creating bridge devices with netlink")
    Reported-by: Hulk Robot
    Signed-off-by: Wang Hai
    Tested-by: Nikolay Aleksandrov
    Acked-by: Nikolay Aleksandrov
    Link: https://lore.kernel.org/r/20201211122921.40386-1-wanghai38@huawei.com
    Signed-off-by: Jakub Kicinski

    Wang Hai
     
  • Currently the xmit path of the MPTCP protocol creates smaller-
    than-max-size skbs, which is suboptimal for the performances.

    There are a few things to improve:
    - when coalescing to an existing skb, must clear the PUSH flag
    - tcp_build_frag() expect the available space as an argument.
    When coalescing is enable MPTCP already subtracted the
    to-be-coalesced skb len. We must increment said argument
    accordingly.

    Before:
    ./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM
    [...]
    131072 16384 16384 30.00 24414.86

    After:
    ./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM
    [...]
    131072 16384 16384 30.05 28357.69

    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Paolo Abeni
     
  • There is no need to unconditionally acquire the join list
    lock, we can simply splice the join list into the subflow
    list and traverse only the latter.

    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Paolo Abeni
     
  • parse the MPTCP FASTCLOSE subtype.

    If provided key matches the local one, schedule the work queue to close
    (with tcp reset) all subflows.

    The MPTCP socket moves to closed state immediately.

    Reviewed-by: Matthieu Baerts
    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Florian Westphal
     
  • Because TCP-level resets only affect the subflow, there is a MPTCP
    option to indicate that the MPTCP-level connection should be closed
    immediately without a mptcp-level fin exchange.

    This is the 'MPTCP fast close option'. It can be carried on ack
    segments or TCP resets. In the latter case, its needed to parse mptcp
    options also for reset packets so that MPTCP can act accordingly.

    Next patch will add receive side fastclose support in MPTCP.

    Acked-by: Matthieu Baerts
    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Florian Westphal
     
  • When processing options from tcp reset path its possible that
    tcp_done(ssk) drops the last reference on the mptcp socket which
    results in use-after-free.

    Reviewed-by: Matthieu Baerts
    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Florian Westphal
     
  • Use the macro MPTCPOPT_HMAC_LEN instead of a constant in struct
    mptcp_options_received.

    Signed-off-by: Geliang Tang
    Signed-off-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Geliang Tang
     
  • When the PM netlink flushes the addresses, invoke the remove address
    function mptcp_nl_remove_subflow_and_signal_addr to remove the addresses
    and the subflows. Since this function should not be invoked under lock,
    move __flush_addrs out of the pernet->lock.

    Acked-by: Paolo Abeni
    Signed-off-by: Geliang Tang
    Signed-off-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Geliang Tang
     
  • It has been observed that the kernel sockets created for the subflows
    (except the first one) are not in the same cgroup as their parents.
    That's because the additional subflows are created by kernel workers.

    This is a problem with eBPF programs attached to the parent's
    cgroup won't be executed for the children. But also with any other features
    of CGroup linked to a sk.

    This patch fixes this behaviour.

    As the subflow sockets are created by the kernel, we can't use
    'mem_cgroup_sk_alloc' because of the current context being the one of the
    kworker. This is why we have to do low level memcg manipulation, if
    required.

    Suggested-by: Matthieu Baerts
    Suggested-by: Paolo Abeni
    Acked-by: Matthieu Baerts
    Signed-off-by: Nicolas Rybowski
    Signed-off-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Nicolas Rybowski
     
  • Currently, the exception actions are not processed correctly as the wrong
    dataset is passed. This change fixes this, including the misleading
    comment.

    In addition, a check was added to make sure we work on an IPv4 packet,
    and not just assume if it's not IPv6 it's IPv4.

    This was all tested using OVS with patch,
    https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639,
    applied and sending packets with a TTL of 1 (and 0), both with IPv4
    and IPv6.

    Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action netlink message format")
    Signed-off-by: Eelco Chaudron
    Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com
    Signed-off-by: Jakub Kicinski

    Eelco Chaudron
     
  • Pull misc fixes from Christian Brauner:
    "This contains several fixes which felt worth being combined into a
    single branch:

    - Use put_nsproxy() instead of open-coding it switch_task_namespaces()

    - Kirill's work to unify lifecycle management for all namespaces. The
    lifetime counters are used identically for all namespaces types.
    Namespaces may of course have additional unrelated counters and
    these are not altered. This work allows us to unify the type of the
    counters and reduces maintenance cost by moving the counter in one
    place and indicating that basic lifetime management is identical
    for all namespaces.

    - Peilin's fix adding three byte padding to Dmitry's
    PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

    - Two smal patches to convert from the /* fall through */ comment
    annotation to the fallthrough keyword annotation which I had taken
    into my branch and into -next before df561f6688fe ("treewide: Use
    fallthrough pseudo-keyword") made it upstream which fixed this
    tree-wide.

    Since I didn't want to invalidate all testing for other commits I
    didn't rebase and kept them"

    * tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    nsproxy: use put_nsproxy() in switch_task_namespaces()
    sys: Convert to the new fallthrough notation
    signal: Convert to the new fallthrough notation
    time: Use generic ns_common::count
    cgroup: Use generic ns_common::count
    mnt: Use generic ns_common::count
    user: Use generic ns_common::count
    pid: Use generic ns_common::count
    ipc: Use generic ns_common::count
    uts: Use generic ns_common::count
    net: Use generic ns_common::count
    ns: Add a common refcount into ns_common
    ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()

    Linus Torvalds