10 Dec, 2014

15 commits

  • There is nice kernel helper to escape a given strings by provided rules. Let's
    use it instead of custom approach.

    Signed-off-by: Andy Shevchenko
    [bfields@redhat.com: fix length calculation]
    Signed-off-by: J. Bruce Fields

    Andy Shevchenko
     
  • ...move the WARN_ON_ONCE inside the following if block since they use
    the same condition.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • These were useful when I was tracking down a race condition between
    svc_xprt_do_enqueue and svc_get_next_xprt.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Testing has shown that the pool->sp_lock can be a bottleneck on a busy
    server. Every time data is received on a socket, the server must take
    that lock in order to dequeue a thread from the sp_threads list.

    Address this problem by eliminating the sp_threads list (which contains
    threads that are currently idle) and replacing it with a RQ_BUSY flag in
    svc_rqst. This allows us to walk the sp_all_threads list under the
    rcu_read_lock and find a suitable thread for the xprt by doing a
    test_and_set_bit.

    Note that we do still have a potential atomicity problem however with
    this approach. We don't want svc_xprt_do_enqueue to set the
    rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
    zero (which indicates that the thread was idle). But, by the time we
    check that, the bit could be flipped by a waking thread.

    To address this, we acquire a new per-rqst spinlock (rq_lock) and take
    that before doing the test_and_set_bit. If that returns false, then we
    can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
    it must set the bit under the same spinlock and can trust that if it was
    already set then the rq_xprt is also properly set.

    With this scheme, the case where we have an idle thread no longer needs
    to take the highly contended pool->sp_lock at all, and that removes the
    bottleneck.

    That still leaves one issue: What of the case where we walk the whole
    sp_all_threads list and don't find an idle thread? Because the search is
    lockess, it's possible for the queueing to race with a thread that is
    going to sleep. To address that, we queue the xprt and then search again.

    If we find an idle thread at that point, we can't attach the xprt to it
    directly since that might race with a different thread waking up and
    finding it. All we can do is wake the idle thread back up and let it
    attempt to find the now-queued xprt.

    Signed-off-by: Jeff Layton
    Tested-by: Chris Worley
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • In a later patch, we'll be removing some spinlocking around the socket
    and thread queueing code in order to fix some contention problems. At
    that point, the stats counters will no longer be protected by the
    sp_lock.

    Change the counters to atomic_long_t fields, except for the
    "sockets_queued" counter which will still be manipulated under a
    spinlock.

    Signed-off-by: Jeff Layton
    Tested-by: Chris Worley
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • ...also make the manipulation of sp_all_threads list use RCU-friendly
    functions.

    Signed-off-by: Jeff Layton
    Tested-by: Chris Worley
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Currently all svc_create callers pass in NULL for the shutdown parm,
    which then gets fixed up to be svc_rpcb_cleanup if the service uses
    rpcbind.

    Simplify this by instead having the the only caller that requires it
    (lockd) pass in svc_rpcb_cleanup and get rid of the special casing.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • The way that svc_wake_up works is a bit inefficient. It walks all of the
    available pools for a service and either wakes up a task in each one or
    sets the SP_TASK_PENDING flag in each one.

    When svc_wake_up is called, there is no need to wake up more than one
    thread to do this work. In practice, only lockd currently uses this
    function and it's single threaded anyway. Thus, this just boils down to
    doing a wake up of a thread in pool 0 or setting a single flag.

    Eliminate the for loop in this function and change it to just operate on
    pool 0. Also update the comments that sit above it and get rid of some
    code that has been commented out for years now.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • In a later patch, we'll want to be able to handle this flag without
    holding the sp_lock. Change this field to an unsigned long flags
    field, and declare a new flag in it that can be managed with atomic
    bitops.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • In a later patch, we're going to need some atomic bit flags. Since that
    field will need to be an unsigned long, we mitigate that space
    consumption by migrating some other bitflags to the new field. Start
    with the rq_secure flag.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Mainly what I need is 860a0d9e511f "sunrpc: add some tracepoints in
    svc_rqst handling functions", which subsequent server rpc patches from
    jlayton depend on. I'm merging this later tag on the assumption that's
    more likely to be a tested and stable point.

    J. Bruce Fields
     

02 Dec, 2014

2 commits


28 Nov, 2014

2 commits

  • Add a new directory heirarchy under the debugfs sunrpc/ directory:

    sunrpc/
    rpc_xprt/
    /

    Within that directory, we can put files that give info about the
    xprts. We do have the (minor) problem that there is no succinct,
    unique identifier for rpc_xprts. So we generate them synthetically
    with a static atomic_t counter.

    For now, this directory just holds an "info" file, but we may add
    other files to it in the future.

    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     
  • It's possible to get a dump of the RPC task queue by writing a value to
    /proc/sys/sunrpc/rpc_debug. If you write any value to that file, you get
    a dump of the RPC client task list into the log buffer. This is a rather
    inconvenient interface however, and makes it hard to get immediate info
    about the task queue.

    Add a new directory hierarchy under debugfs:

    sunrpc/
    rpc_clnt/
    /

    Within each clientid directory we create a new "tasks" file that will
    dump info similar to what shows up in the log buffer, but with a few
    small differences -- we avoid printing raw kernel addresses in favor of
    symbolic names and the XID is also displayed.

    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     

27 Nov, 2014

2 commits

  • Pull NFS client RDMA changes for 3.19 from Anna Schumaker:
    "NFS: Client side changes for RDMA

    These patches various bugfixes and cleanups for using NFS over RDMA, including
    better error handling and performance improvements by using pad optimization.

    Signed-off-by: Anna Schumaker "

    * tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
    xprtrdma: Display async errors
    xprtrdma: Enable pad optimization
    xprtrdma: Re-write rpcrdma_flush_cqs()
    xprtrdma: Refactor tasklet scheduling
    xprtrdma: unmap all FMRs during transport disconnect
    xprtrdma: Cap req_cqinit
    xprtrdma: Return an errno from rpcrdma_register_external()

    Trond Myklebust
     
  • Pull pull additional NFS client changes for 3.19 from Anna Schumaker:
    "NFS: Generic client side changes from Chuck

    These patches fixes for iostats and SETCLIENTID in addition to cleaning
    up the nfs4_init_callback() function.

    Signed-off-by: Anna Schumaker "

    * tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
    NFS: Clean up nfs4_init_callback()
    NFS: SETCLIENTID XDR buffer sizes are incorrect
    SUNRPC: serialize iostats updates

    Trond Myklebust
     

26 Nov, 2014

8 commits

  • Occasionally mountstats reports a negative retransmission rate.
    Ensure that two RPCs completing concurrently don't confuse the sums
    in the transport's op_metrics array.

    Since pNFS filelayout can invoke rpc_count_iostats() on another
    transport from xprt_release(), we can't rely on simply holding the
    transport_lock in xprt_release(). There's nothing for it but hard
    serialization. One spin lock per RPC operation should make this as
    painless as it can be.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • An async error upcall is a hard error, and should be reported in
    the system log.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
    pad optimization was enabled. That bug was fixed by commit
    e560e3b510d2 ("svcrdma: Add zero padding if the client doesn't send
    it").

    We can now enable pad optimization on the client, which helps
    performance and is supported now by both Linux and Solaris servers.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Currently rpcrdma_flush_cqs() attempts to avoid code duplication,
    and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall.

    1. rpcrdma_flush_cqs() can run concurrently with provider upcalls.
    Both flush_cqs() and the upcalls were invoking ib_poll_cq() in
    different threads using the same wc buffers (ep->rep_recv_wcs
    and ep->rep_send_wcs), added by commit 1c00dd077654 ("xprtrmda:
    Reduce calls to ib_poll_cq() in completion handlers").

    During transport disconnect processing, this sometimes resulted
    in the same reply getting added to the rpcrdma_tasklets_g list
    more than once, which corrupted the list.

    2. The upcall functions drain only a limited number of CQEs,
    thanks to the poll budget added by commit 8301a2c047cc
    ("xprtrdma: Limit work done by completion handler").

    Fixes: a7bc211ac926 ("xprtrdma: On disconnect, don't ignore ... ")
    BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=276
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Restore the separate function that schedules the reply handling
    tasklet. I need to call it from two different paths.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • When using RPCRDMA_MTHCAFMR memory registration, after a few
    transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to
    return EINVAL because the provider has exhausted its map pool.

    Make sure that all FMRs are unmapped during transport disconnect,
    and that ->send_request remarshals them during an RPC retransmit.
    This resets the transport's MRs to ensure that none are leaked
    during a disconnect.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Recent work made FRMR registration and invalidation completions
    unsignaled. This greatly reduces the adapter interrupt rate.

    Every so often, however, a posted send Work Request is allowed to
    signal. Otherwise, the provider's Work Queue will wrap and the
    workload will hang.

    The number of Work Requests that are allowed to remain unsignaled is
    determined by the value of req_cqinit. Currently, this is set to the
    size of the send Work Queue divided by two, minus 1.

    For FRMR, the send Work Queue is the maximum number of concurrent
    RPCs (currently 32) times the maximum number of Work Requests an
    RPC might use (currently 7, though some adapters may need more).

    For mlx4, this is 224 entries. This leaves completion signaling
    disabled for 111 send Work Requests.

    Some providers hold back dispatching Work Requests until a CQE is
    generated. If completions are disabled, then no CQEs are generated
    for quite some time, and that can stall the Work Queue.

    I've seen this occur running xfstests generic/113 over NFSv4, where
    eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
    because the Work Queue has overflowed. The connection is dropped
    and re-established.

    Cap the rep_cqinit setting so completions are not left turned off
    for too long.

    BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • The RPC/RDMA send_request method and the chunk registration code
    expects an errno from the registration function. This allows
    the upper layers to distinguish between a recoverable failure
    (for example, temporary memory exhaustion) and a hard failure
    (for example, a bug in the registration logic).

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     

25 Nov, 2014

5 commits


20 Nov, 2014

2 commits


14 Nov, 2014

1 commit

  • Bruce reported that he was seeing the following BUG pop:

    BUG: sleeping function called from invalid context at mm/slab.c:2846
    in_atomic(): 0, irqs_disabled(): 0, pid: 4539, name: mount.nfs
    2 locks held by mount.nfs/4539:
    #0: (nfs_clid_init_mutex){+.+.+.}, at: [] nfs4_discover_server_trunking+0x4a/0x2f0 [nfsv4]
    #1: (rcu_read_lock){......}, at: [] gss_stringify_acceptor+0x5/0xb0 [auth_rpcgss]
    Preemption disabled at:[] printk+0x4d/0x4f

    CPU: 3 PID: 4539 Comm: mount.nfs Not tainted 3.18.0-rc1-00013-g5b095e9 #3393
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    ffff880021499390 ffff8800381476a8 ffffffff81a534cf 0000000000000001
    0000000000000000 ffff8800381476c8 ffffffff81097854 00000000000000d0
    0000000000000018 ffff880038147718 ffffffff8118e4f3 0000000020479f00
    Call Trace:
    [] dump_stack+0x4f/0x7c
    [] __might_sleep+0x114/0x180
    [] __kmalloc+0x1a3/0x280
    [] gss_stringify_acceptor+0x58/0xb0 [auth_rpcgss]
    [] ? gss_stringify_acceptor+0x5/0xb0 [auth_rpcgss]
    [] rpcauth_stringify_acceptor+0x18/0x30 [sunrpc]
    [] nfs4_proc_setclientid+0x199/0x380 [nfsv4]
    [] ? nfs4_proc_setclientid+0x200/0x380 [nfsv4]
    [] nfs40_discover_server_trunking+0xda/0x150 [nfsv4]
    [] ? nfs40_discover_server_trunking+0x5/0x150 [nfsv4]
    [] nfs4_discover_server_trunking+0x7f/0x2f0 [nfsv4]
    [] nfs4_init_client+0x104/0x2f0 [nfsv4]
    [] nfs_get_client+0x314/0x3f0 [nfs]
    [] ? nfs_get_client+0xe0/0x3f0 [nfs]
    [] nfs4_set_client+0x8a/0x110 [nfsv4]
    [] ? __rpc_init_priority_wait_queue+0xa8/0xf0 [sunrpc]
    [] nfs4_create_server+0x12f/0x390 [nfsv4]
    [] nfs4_remote_mount+0x32/0x60 [nfsv4]
    [] mount_fs+0x39/0x1b0
    [] ? __alloc_percpu+0x15/0x20
    [] vfs_kern_mount+0x6b/0x150
    [] nfs_do_root_mount+0x86/0xc0 [nfsv4]
    [] nfs4_try_mount+0x44/0xc0 [nfsv4]
    [] ? get_nfs_version+0x27/0x90 [nfs]
    [] nfs_fs_mount+0x47d/0xd60 [nfs]
    [] ? mutex_unlock+0xe/0x10
    [] ? nfs_remount+0x430/0x430 [nfs]
    [] ? nfs_clone_super+0x140/0x140 [nfs]
    [] mount_fs+0x39/0x1b0
    [] ? __alloc_percpu+0x15/0x20
    [] vfs_kern_mount+0x6b/0x150
    [] do_mount+0x210/0xbe0
    [] ? copy_mount_options+0x3a/0x160
    [] SyS_mount+0x6f/0xb0
    [] system_call_fastpath+0x12/0x17

    Sleeping under the rcu_read_lock is bad. This patch fixes it by dropping
    the rcu_read_lock before doing the allocation and then reacquiring it
    and redoing the dereference before doing the copy. If we find that the
    string has somehow grown in the meantime, we'll reallocate and try again.

    Cc: # v3.17+
    Reported-by: "J. Bruce Fields"
    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     

29 Oct, 2014

1 commit

  • The m->pool_to[] array has "maxpools" number of elements. It's
    allocated in svc_pool_map_alloc_arrays() which we called earlier in the
    function. This test should be >= instead of >.

    Signed-off-by: Dan Carpenter
    Signed-off-by: J. Bruce Fields

    Dan Carpenter
     

24 Oct, 2014

2 commits