14 Feb, 2020

1 commit

  • The @nents value that was passed to ib_dma_map_sg() has to be passed
    to the matching ib_dma_unmap_sg() call. If ib_dma_map_sg() choses to
    concatenate sg entries, it will return a different nents value than
    it was passed.

    The bug was exposed by recent changes to the AMD IOMMU driver, which
    enabled sg entry concatenation.

    Looking all the way back to commit 4143f34e01e9 ("xprtrdma: Port to
    new memory registration API") and reviewing other kernel ULPs, it's
    not clear that the frwr_map() logic was ever correct for this case.

    Reported-by: Andre Tomt
    Suggested-by: Robin Murphy
    Signed-off-by: Chuck Lever
    Cc: stable@vger.kernel.org
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Anna Schumaker

    Chuck Lever
     

08 Feb, 2020

3 commits

  • Pull nfsd updates from Bruce Fields:
    "Highlights:

    - Server-to-server copy code from Olga.

    To use it, client and both servers must have support, the target
    server must be able to access the source server over NFSv4.2, and
    the target server must have the inter_copy_offload_enable module
    parameter set.

    - Improvements and bugfixes for the new filehandle cache, especially
    in the container case, from Trond

    - Also from Trond, better reporting of write errors.

    - Y2038 work from Arnd"

    * tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux: (55 commits)
    sunrpc: expiry_time should be seconds not timeval
    nfsd: make nfsd_filecache_wq variable static
    nfsd4: fix double free in nfsd4_do_async_copy()
    nfsd: convert file cache to use over/underflow safe refcount
    nfsd: Define the file access mode enum for tracing
    nfsd: Fix a perf warning
    nfsd: Ensure sampling of the write verifier is atomic with the write
    nfsd: Ensure sampling of the commit verifier is atomic with the commit
    sunrpc: clean up cache entry add/remove from hashtable
    sunrpc: Fix potential leaks in sunrpc_cache_unhash()
    nfsd: Ensure exclusion between CLONE and WRITE errors
    nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()
    nfsd: Update the boot verifier on stable writes too.
    nfsd: Fix stable writes
    nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argument
    nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create()
    nfsd: Reduce the number of calls to nfsd_file_gc()
    nfsd: Schedule the laundrette regularly irrespective of file errors
    nfsd: Remove unused constant NFSD_FILE_LRU_RESCAN
    nfsd: Containerise filecache laundrette
    ...

    Linus Torvalds
     
  • Puyll NFS client updates from Anna Schumaker:
    "Stable bugfixes:
    - Fix memory leaks and corruption in readdir # v2.6.37+
    - Directory page cache needs to be locked when read # v2.6.37+

    New features:
    - Convert NFS to use the new mount API
    - Add "softreval" mount option to let clients use cache if server goes down
    - Add a config option to compile without UDP support
    - Limit the number of inactive delegations the client can cache at once
    - Improved readdir concurrency using iterate_shared()

    Other bugfixes and cleanups:
    - More 64-bit time conversions
    - Add additional diagnostic tracepoints
    - Check for holes in swapfiles, and add dependency on CONFIG_SWAP
    - Various xprtrdma cleanups to prepare for 5.7's changes
    - Several fixes for NFS writeback and commit handling
    - Fix acls over krb5i/krb5p mounts
    - Recover from premature loss of openstateids
    - Fix NFS v3 chacl and chmod bug
    - Compare creds using cred_fscmp()
    - Use kmemdup_nul() in more places
    - Optimize readdir cache page invalidation
    - Lease renewal and recovery fixes"

    * tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits)
    NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals
    NFSv4: try lease recovery on NFS4ERR_EXPIRED
    NFS: Fix memory leaks
    nfs: optimise readdir cache page invalidation
    NFS: Switch readdir to using iterate_shared()
    NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()
    NFS: Directory page cache pages need to be locked when read
    NFS: Fix memory leaks and corruption in readdir
    SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()
    NFS: Replace various occurrences of kstrndup() with kmemdup_nul()
    NFSv4: Limit the total number of cached delegations
    NFSv4: Add accounting for the number of active delegations held
    NFSv4: Try to return the delegation immediately when marked for return on close
    NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
    NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
    NFS: nfs_find_open_context() should use cred_fscmp()
    NFS: nfs_access_get_cached_rcu() should use cred_fscmp()
    NFSv4: pnfs_roc() must use cred_fscmp() to compare creds
    NFS: remove unused macros
    nfs: Return EINVAL rather than ERANGE for mount parse errors
    ...

    Linus Torvalds
     
  • When upcalling gssproxy, cache_head.expiry_time is set as a
    timeval, not seconds since boot. As such, RPC cache expiry
    logic will not clean expired objects created under
    auth.rpcsec.context cache.

    This has proven to cause kernel memory leaks on field. Using
    64 bit variants of getboottime/timespec

    Expiration times have worked this way since 2010's c5b29f885afe "sunrpc:
    use seconds since boot in expiry cache". The gssproxy code introduced
    in 2012 added gss_proxy_save_rsc and introduced the bug. That's a while
    for this to lurk, but it required a bit of an extreme case to make it
    obvious.

    Signed-off-by: Roberto Bergantinos Corpas
    Cc: stable@vger.kernel.org
    Fixes: 030d794bf498 "SUNRPC: Use gssproxy upcall for server..."
    Tested-By: Frank Sorenson
    Signed-off-by: J. Bruce Fields

    Roberto Bergantinos Corpas
     

04 Feb, 2020

2 commits

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Using kmemdup_nul() is more efficient when the length is known.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Anna Schumaker

    Trond Myklebust
     

30 Jan, 2020

1 commit

  • …/kernel/git/arnd/playground

    Pull y2038 updates from Arnd Bergmann:
    "Core, driver and file system changes

    These are updates to device drivers and file systems that for some
    reason or another were not included in the kernel in the previous
    y2038 series.

    I've gone through all users of time_t again to make sure the kernel is
    in a long-term maintainable state, replacing all remaining references
    to time_t with safe alternatives.

    Some related parts of the series were picked up into the nfsd, xfs,
    alsa and v4l2 trees. A final set of patches in linux-mm removes the
    now unused time_t/timeval/timespec types and helper functions after
    all five branches are merged for linux-5.6, ensuring that no new users
    get merged.

    As a result, linux-5.6, or my backport of the patches to 5.4 [1],
    should be the first release that can serve as a base for a 32-bit
    system designed to run beyond year 2038, with a few remaining caveats:

    - All user space must be compiled with a 64-bit time_t, which will be
    supported in the coming musl-1.2 and glibc-2.32 releases, along
    with installed kernel headers from linux-5.6 or higher.

    - Applications that use the system call interfaces directly need to
    be ported to use the time64 syscalls added in linux-5.1 in place of
    the existing system calls. This impacts most users of futex() and
    seccomp() as well as programming languages that have their own
    runtime environment not based on libc.

    - Applications that use a private copy of kernel uapi header files or
    their contents may need to update to the linux-5.6 version, in
    particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
    linux/elfcore.h, linux/sockios.h, linux/timex.h and
    linux/can/bcm.h.

    - A few remaining interfaces cannot be changed to pass a 64-bit
    time_t in a compatible way, so they must be configured to use
    CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
    timestamps. Most importantly this impacts all users of 'struct
    input_event'.

    - All y2038 problems that are present on 64-bit machines also apply
    to 32-bit machines. In particular this affects file systems with
    on-disk timestamps using signed 32-bit seconds: ext4 with
    ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"

    [1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame

    * tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
    Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
    y2038: sh: remove timeval/timespec usage from headers
    y2038: sparc: remove use of struct timex
    y2038: rename itimerval to __kernel_old_itimerval
    y2038: remove obsolete jiffies conversion functions
    nfs: fscache: use timespec64 in inode auxdata
    nfs: fix timstamp debug prints
    nfs: use time64_t internally
    sunrpc: convert to time64_t for expiry
    drm/etnaviv: avoid deprecated timespec
    drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
    drm/msm: avoid using 'timespec'
    hfs/hfsplus: use 64-bit inode timestamps
    hostfs: pass 64-bit timestamps to/from user space
    packet: clarify timestamp overflow
    tsacct: add 64-bit btime field
    acct: stop using get_seconds()
    um: ubd: use 64-bit time_t where possible
    xtensa: ISS: avoid struct timeval
    dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
    ...

    Linus Torvalds
     

23 Jan, 2020

2 commits


15 Jan, 2020

17 commits

  • Remove gss_mech_list_pseudoflavors() and its callers. This is part of
    an unused API, and could leak an RCU reference if it were ever called.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Anna Schumaker

    Trond Myklebust
     
  • Clean up: This simplifies the logic in rpcrdma_post_recvs.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • To safely get rid of all rpcrdma_reps from a particular connection
    instance, xprtrdma has to wait until each of those reps is finished
    being used. A rep may be backing the rq_rcv_buf of an RPC that has
    just completed, for example.

    Since it is safe to invoke rpcrdma_rep_destroy() only in the Receive
    completion handler, simply mark reps remaining in the rb_all_reps
    list after the transport is drained. These will then be deleted as
    rpcrdma_post_recvs pulls them off the rep free list.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • This reduces the hardware and memory footprint of an unconnected
    transport.

    At some point in the future, transport reconnect will allow
    resolving the destination IP address through a different device. The
    current change enables reps for the new connection to be allocated
    on whichever NUMA node the new device affines to after a reconnect.

    Note that this does not destroy _all_ the transport's reps... there
    will be a few that are still part of a running RPC completion.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Currently the underlying RDMA device is chosen at transport set-up
    time. But it will soon be at connect time instead.

    The maximum size of a transport header is based on device
    capabilities. Thus transport header buffers have to be allocated
    _after_ the underlying device has been chosen (via address and route
    resolution); ie, in the connect worker.

    Thus, move the allocation of transport header buffers to the connect
    worker, after the point at which the underlying RDMA device has been
    chosen.

    This also means the RDMA device is available to do a DMA mapping of
    these buffers at connect time, instead of in the hot I/O path. Make
    that optimization as well.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Refactor: Perform the "is supported" check in rpcrdma_ep_create()
    instead of in rpcrdma_ia_open(). frwr_open() is where most of the
    logic to query device attributes is already located.

    The current code displays a redundant error message when the device
    does not support FRWR. As an additional clean-up, this patch removes
    the extra message.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • To support device hotplug and migrating a connection between devices
    of different capabilities, we have to guarantee that all in-kernel
    devices can support the same max NFS payload size (1 megabyte).

    This means that possibly one or two in-tree devices are no longer
    supported for NFS/RDMA because they cannot support 1MB rsize/wsize.
    The only one I confirmed was cxgb3, but it has already been removed
    from the kernel.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up: there is no need to keep two copies of the same value.
    Also, in subsequent patches, rpcrdma_ep_create() will be called in
    the connect worker rather than at set-up time.

    Minor fix: Initialize the transport's sendctx to the value based on
    the capabilities of the underlying device, not the maximum setting.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • The size of the sendctx queue depends on the value stored in
    ia->ri_max_send_sges. This value is determined by querying the
    underlying device.

    Eventually, rpcrdma_ia_open() and rpcrdma_ep_create() will be called
    in the connect worker rather than at transport set-up time. The
    underlying device will not have been chosen device set-up time.

    The sendctx queue will thus have to be created after the underlying
    device has been chosen via address and route resolution; in other
    words, in the connect worker.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean-up. The max_send_sge value also happens to be stored in
    ep->rep_attr. Let's keep just a single copy.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • The empty_iov structure is only copied into another structure,
    so make it const.

    The opportunity for this change was found using Coccinelle.

    Signed-off-by: Julia Lawall
    Signed-off-by: Anna Schumaker

    Julia Lawall
     
  • The xprtrdma connect logic can return -EPROTO if the underlying
    device or network path does not support RDMA. This can happen
    after a device removal/insertion.

    - When SOFTCONN is set, EPROTO is a permanent error.

    - When SOFTCONN is not set, EPROTO is treated as a temporary error.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Using signed 32-bit types for UTC time leads to the y2038 overflow,
    which is what happens in the sunrpc code at the moment.

    This changes the sunrpc code over to use time64_t where possible.
    The one exception is the gss_import_v{1,2}_context() function for
    kerberos5, which uses 32-bit timestamps in the protocol. Here,
    we can at least treat the numbers as 'unsigned', which extends the
    range from 2038 to 2106.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Anna Schumaker

    Arnd Bergmann
     
  • Since v5.4, a device removal occasionally triggered this oops:

    Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219
    Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode
    Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page
    Dec 2 17:13:53 manet kernel: PGD 0 P4D 0
    Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP
    Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883
    Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
    Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
    Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma]
    Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0
    Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246
    Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048
    Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001
    Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000
    Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428
    Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010
    Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
    Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0
    Dec 2 17:13:53 manet kernel: Call Trace:
    Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core]
    Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core]
    Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd
    Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
    Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a
    Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
    Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9
    Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74
    Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30

    The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that
    is still pointing to the old ib_device, which has been freed. The
    only way that is possible is if this rpcrdma_rep was not destroyed
    by rpcrdma_ia_remove.

    Debugging showed that was indeed the case: this rpcrdma_rep was
    still in use by a completing RPC at the time of the device removal,
    and thus wasn't on the rep free list. So, it was not found by
    rpcrdma_reps_destroy().

    The fix is to introduce a list of all rpcrdma_reps so that they all
    can be found when a device is removed. That list is used to perform
    only regbuf DMA unmapping, replacing that call to
    rpcrdma_reps_destroy().

    Meanwhile, to prevent corruption of this list, I've moved the
    destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs().
    rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is
    not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus
    protecting the rb_all_reps list.

    Fixes: b0b227f071a0 ("xprtrdma: Use an llist to manage free rpcrdma_reps")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • I've found that on occasion, "rmmod " will hang while if an NFS
    is under load.

    Ensure that ri_remove_done is initialized only just before the
    transport is woken up to force a close. This avoids the completion
    possibly getting initialized again while the CM event handler is
    waiting for a wake-up.

    Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • On device re-insertion, the RDMA device driver crashes trying to set
    up a new QP:

    Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0
    Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode
    Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page
    Nov 27 16:32:06 manet kernel: PGD 0 P4D 0
    Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP
    Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G W 5.4.0 #852
    Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
    Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
    Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12
    Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00
    Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046
    Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000
    Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0
    Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000
    Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8
    Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000
    Nov 27 16:32:06 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000
    Nov 27 16:32:06 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0
    Nov 27 16:32:06 manet kernel: Call Trace:
    Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a
    Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib]
    Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a
    Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139
    Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib]

    The fix is to copy the qp_init_attr struct that was just created by
    rpcrdma_ep_create() instead of using the one from the previous
    connection instance.

    Fixes: 98ef77d1aaa7 ("xprtrdma: Send Queue size grows after a reconnect")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     

19 Dec, 2019

2 commits

  • The timestamps for the cache are all in boottime seconds, so they
    don't overflow 32-bit values, but the use of time_t is deprecated
    because it generally does overflow when used with wall-clock time.

    There are multiple possible ways of avoiding it:

    - leave time_t, which is safe here, but forces others to
    look into this code to determine that it is over and over.

    - use a more generic type, like 'int' or 'long', which is known
    to be sufficient here but loses the documentation of referring
    to timestamps

    - use ktime_t everywhere, and convert into seconds in the few
    places where we want realtime-seconds. The conversion is
    sometimes expensive, but not more so than the conversion we
    do today.

    - use time64_t to clarify that this code is safe. Nothing would
    change for 64-bit architectures, but it is slightly less
    efficient on 32-bit architectures.

    Without a clear winner of the three approaches above, this picks
    the last one, favouring readability over a small performance
    loss on 32-bit architectures.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • Using signed 32-bit types for UTC time leads to the y2038 overflow,
    which is what happens in the sunrpc code at the moment.

    This changes the sunrpc code over to use time64_t where possible.
    The one exception is the gss_import_v{1,2}_context() function for
    kerberos5, which uses 32-bit timestamps in the protocol. Here,
    we can at least treat the numbers as 'unsigned', which extends the
    range from 2038 to 2106.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

08 Dec, 2019

2 commits

  • Pull nfsd updates from Bruce Fields:
    "This is a relatively quiet cycle for nfsd, mainly various bugfixes.

    Possibly most interesting is Trond's fixes for some callback races
    that were due to my incomplete understanding of rpc client shutdown.
    Unfortunately at the last minute I've started noticing a new
    intermittent failure to send callbacks. As the logic seems basically
    correct, I'm leaving Trond's patches in for now, and hope to find a
    fix in the next week so I don't have to revert those patches"

    * tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits)
    nfsd: depend on CRYPTO_MD5 for legacy client tracking
    NFSD fixing possible null pointer derefering in copy offload
    nfsd: check for EBUSY from vfs_rmdir/vfs_unink.
    nfsd: Ensure CLONE persists data and metadata changes to the target file
    SUNRPC: Fix backchannel latency metrics
    nfsd: restore NFSv3 ACL support
    nfsd: v4 support requires CRYPTO_SHA256
    nfsd: Fix cld_net->cn_tfm initialization
    lockd: remove __KERNEL__ ifdefs
    sunrpc: remove __KERNEL__ ifdefs
    race in exportfs_decode_fh()
    nfsd: Drop LIST_HEAD where the variable it declares is never used.
    nfsd: document callback_wq serialization of callback code
    nfsd: mark cb path down on unknown errors
    nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
    nfsd: minor 4.1 callback cleanup
    SUNRPC: Fix svcauth_gss_proxy_init()
    SUNRPC: Trace gssproxy upcall results
    sunrpc: fix crash when cache_head become valid before update
    nfsd: remove private bin2hex implementation
    ...

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Features:

    - NFSv4.2 now supports cross device offloaded copy (i.e. offloaded
    copy of a file from one source server to a different target
    server).

    - New RDMA tracepoints for debugging congestion control and Local
    Invalidate WRs.

    Bugfixes and cleanups

    - Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
    layoutreturn

    - Handle bad/dead sessions correctly in nfs41_sequence_process()

    - Various bugfixes to the delegation return operation.

    - Various bugfixes pertaining to delegations that have been revoked.

    - Cleanups to the NFS timespec code to avoid unnecessary conversions
    between timespec and timespec64.

    - Fix unstable RDMA connections after a reconnect

    - Close race between waking an RDMA sender and posting a receive

    - Wake pending RDMA tasks if connection fails

    - Fix MR list corruption, and clean up MR usage

    - Fix another RPCSEC_GSS issue with MIC buffer space"

    * tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
    SUNRPC: Capture completion of all RPC tasks
    SUNRPC: Fix another issue with MIC buffer space
    NFS4: Trace lock reclaims
    NFS4: Trace state recovery operation
    NFSv4.2 fix memory leak in nfs42_ssc_open
    NFSv4.2 fix kfree in __nfs42_copy_file_range
    NFS: remove duplicated include from nfs4file.c
    NFSv4: Make _nfs42_proc_copy_notify() static
    NFS: Fallocate should use the nfs4_fattr_bitmap
    NFS: Return -ETXTBSY when attempting to write to a swapfile
    fs: nfs: sysfs: Remove NULL check before kfree
    NFS: remove unneeded semicolon
    NFSv4: add declaration of current_stateid
    NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn
    NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()
    nfsv4: Move NFSPROC4_CLNT_COPY_NOTIFY to end of list
    SUNRPC: Avoid RPC delays when exiting suspend
    NFS: Add a tracepoint in nfs_fh_to_dentry()
    NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done()
    NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn
    ...

    Linus Torvalds
     

05 Dec, 2019

1 commit

  • blocking_notifier_chain_cond_register() does not consider system_booting
    state, which is the only difference between this function and
    blocking_notifier_cain_register(). This can be a bug and is a piece of
    duplicate code.

    Delete blocking_notifier_chain_cond_register()

    Link: http://lkml.kernel.org/r/1568861888-34045-4-git-send-email-nixiaoming@huawei.com
    Signed-off-by: Xiaoming Ni
    Reviewed-by: Andrew Morton
    Cc: Alan Stern
    Cc: Alexey Dobriyan
    Cc: Andy Lutomirski
    Cc: Anna Schumaker
    Cc: Arjan van de Ven
    Cc: Chuck Lever
    Cc: David S. Miller
    Cc: Ingo Molnar
    Cc: J. Bruce Fields
    Cc: Jeff Layton
    Cc: Nadia Derbey
    Cc: "Paul E. McKenney"
    Cc: Sam Protsenko
    Cc: Thomas Gleixner
    Cc: Trond Myklebust
    Cc: Vasily Averin
    Cc: Viresh Kumar
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaoming Ni
     

23 Nov, 2019

1 commit

  • RPC tasks on the backchannel never invoke xprt_complete_rqst(), so
    there is no way to report their tk_status at completion. Also, any
    RPC task that exits via rpc_exit_task() before it is replied to will
    also disappear without a trace.

    Introduce a trace point that is symmetrical with rpc_task_begin that
    captures the termination status of each RPC task.

    Sample trace output for callback requests initiated on the server:
    kworker/u8:12-448 [003] 127.025240: rpc_task_end: task:50@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task
    kworker/u8:12-448 [002] 127.567310: rpc_task_end: task:51@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task
    kworker/u8:12-448 [001] 130.506817: rpc_task_end: task:52@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task

    Odd, though, that I never see trace_rpc_task_complete, either in the
    forward or backchannel. Should it be removed?

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

22 Nov, 2019

1 commit

  • I noticed that for callback requests, the reported backlog latency
    is always zero, and the rtt value is crazy big. The problem was that
    rqst->rq_xtime is never set for backchannel requests.

    Fixes: 78215759e20d ("SUNRPC: Make RTT measurement more ... ")
    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     

18 Nov, 2019

2 commits

  • xdr_shrink_pagelen() BUG's when @len is larger than buf->page_len.
    This can happen when xdr_buf_read_mic() is given an xdr_buf with
    a small page array (like, only a few bytes).

    Instead, just cap the number of bytes that xdr_shrink_pagelen()
    will move.

    Fixes: 5f1bc39979d ("SUNRPC: Fix buffer handling of GSS MIC ... ")
    Signed-off-by: Chuck Lever
    Reviewed-by: Benjamin Coddington
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • NFSoRDMA Client Updates for Linux 5.5

    New Features:
    - New tracepoints for congestion control and Local Invalidate WRs

    Bugfixes and Cleanups:
    - Eliminate log noise in call_reserveresult
    - Fix unstable connections after a reconnect
    - Clean up some code duplication
    - Close race between waking a sender and posting a receive
    - Fix MR list corruption, and clean up MR usage
    - Remove unused rpcrdma_sendctx fields
    - Try to avoid DMA mapping pages if it is too costly
    - Wake pending tasks if connection fails
    - Replace some dprintk()s with tracepoints

    Trond Myklebust
     

06 Nov, 2019

1 commit

  • Jon Hunter: "I have been tracking down another suspend/NFS related
    issue where again I am seeing random delays exiting suspend. The delays
    can be up to a couple minutes in the worst case and this is causing a
    suspend test we have to fail."

    Change the use of a deferrable work to a standard delayed one.

    Reported-by: Jon Hunter
    Tested-by: Jon Hunter
    Fixes: 7e0a0e38fcfea ("SUNRPC: Replace the queue timer with a delayed work function")
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

04 Nov, 2019

1 commit

  • NFSv2, v3 and NFSv4 servers often have duplicate replay caches that look
    at the source port when deciding whether or not an RPC call is a replay
    of a previous call. This requires clients to perform strange TCP gymnastics
    in order to ensure that when they reconnect to the server, they bind
    to the same source port.

    NFSv4.1 and NFSv4.2 have sessions that provide proper replay semantics,
    that do not look at the source port of the connection. This patch therefore
    ensures they can ignore the rebind requirement.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

31 Oct, 2019

3 commits

  • gss_read_proxy_verf() assumes things about the XDR buffer containing
    the RPC Call that are not true for buffers generated by
    svc_rdma_recv().

    RDMA's buffers look more like what the upper layer generates for
    sending: head is a kmalloc'd buffer; it does not point to a page
    whose contents are contiguous with the first page in the buffers'
    page array. The result is that ACCEPT_SEC_CONTEXT via RPC/RDMA has
    stopped working on Linux NFS servers that use gssproxy.

    This does not affect clients that use only TCP to send their
    ACCEPT_SEC_CONTEXT operation (that's all Linux clients). Other
    clients, like Solaris NFS clients, send ACCEPT_SEC_CONTEXT on the
    same transport as they send all other NFS operations. Such clients
    can send ACCEPT_SEC_CONTEXT via RPC/RDMA.

    I thought I had found every direct reference in the server RPC code
    to the rqstp->rq_pages field.

    Bug found at the 2019 Westford NFS bake-a-thon.

    Fixes: 3316f0631139 ("svcrdma: Persistently allocate and DMA- ... ")
    Signed-off-by: Chuck Lever
    Tested-by: Bill Baker
    Reviewed-by: Simo Sorce
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     
  • Record results of a GSS proxy ACCEPT_SEC_CONTEXT upcall and the
    svc_authenticate() function to make field debugging of NFS server
    Kerberos issues easier.

    Signed-off-by: Chuck Lever
    Reviewed-by: Bill Baker
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     
  • When we're destroying the host transport mechanism, we should ensure
    that we do not leak memory by failing to release any back channel
    slots that might still exist.

    Reported-by: Neil Brown
    Reported-by: kbuild test robot
    Signed-off-by: Trond Myklebust
    Signed-off-by: Anna Schumaker

    Trond Myklebust