11 Jul, 2007

2 commits

  • Cleanup argument passing to functions for creating an RPC transport.

    Signed-off-by: Frank van Maarseveen
    Signed-off-by: Trond Myklebust

    Frank van Maarseveen
     
  • Brian Behlendorf writes:

    The root cause of the NFS hang we were observing appears to be a rare
    deadlock between the kernel provided usermodehelper API and the linux NFS
    client. The deadlock can arise because both of these services use the
    generic linux work queues. The usermodehelper API run the specified user
    application in the context of the work queue. And NFS submits both cleanup
    and reconnect work to the generic work queue for handling. Normally this
    is fine but a deadlock can result in the following situation.

    - NFS client is in a disconnected state
    - [events/0] runs a usermodehelper app with an NFS dependent operation,
    this triggers an NFS reconnect.
    - NFS reconnect happens to be submitted to [events/0] work queue.
    - Deadlock, the [events/0] work queue will never process the
    reconnect because it is blocked on the previous NFS dependent
    operation which will not complete.`

    The solution is simply to run reconnect requests on rpciod.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

01 May, 2007

3 commits

  • Introduce a replacement for the in-kernel portmapper client that supports
    all 3 versions of the rpcbind protocol. This code is not used yet.

    Original code by Groupe Bull updated for the latest kernel, with multiple
    bug fixes.

    Note that rpcb_clnt.c does not yet support registering via versions 3 and
    4 of the rpcbind protocol. That is planned for a later patch.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Currently rpc_malloc sets req->rq_buffer internally. Make this a more
    generic interface: return a pointer to the new buffer (or NULL) and
    make the caller set req->rq_buffer and req->rq_bufsize. This looks much
    more like kmalloc and eliminates the side effects.

    To fix a potential deadlock, this patch also replaces GFP_NOFS with
    GFP_NOWAIT in rpc_malloc. This prevents async RPCs from sleeping outside
    the RPC's task scheduler while allocating their buffer.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • The RPC buffer size estimation logic in net/sunrpc/clnt.c always
    significantly overestimates the requirements for the buffer size.
    A little instrumentation demonstrated that in fact rpc_malloc was never
    allocating the buffer from the mempool, but almost always called kmalloc.

    To compute the size of the RPC buffer more precisely, split p_bufsiz into
    two fields; one for the argument size, and one for the result size.

    Then, compute the sum of the exact call and reply header sizes, and split
    the RPC buffer precisely between the two. That should keep almost all RPC
    buffers within the 2KiB buffer mempool limit.

    And, we can finally be rid of RPC_SLACK_SPACE!

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

21 Apr, 2007

1 commit

  • Fix a regression due to the patch "NFS: disconnect before retrying NFSv4
    requests over TCP"

    The assumption made in xprt_transmit() that the condition
    "req->rq_bytes_sent == 0 and request is on the receive list"
    should imply that we're dealing with a retransmission is false.
    Firstly, it may simply happen that the socket send queue was full
    at the time the request was initially sent through xprt_transmit().
    Secondly, doing this for each request that was retransmitted implies
    that we disconnect and reconnect for _every_ request that happened to
    be retransmitted irrespective of whether or not a disconnection has
    already occurred.

    Fix is to move this logic into the call_status request timeout handler.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

13 Feb, 2007

2 commits

  • Conflicts:

    net/sunrpc/auth_gss/gss_krb5_crypto.c
    net/sunrpc/auth_gss/gss_spkm3_token.c
    net/sunrpc/clnt.c

    Merge with mainline and fix conflicts.

    Trond Myklebust
     
  • RFC3530 section 3.1.1 states an NFSv4 client MUST NOT send a request
    twice on the same connection unless it is the NULL procedure. Section
    3.1.1 suggests that the client should disconnect and reconnect if it
    wants to retry a request.

    Implement this by adding an rpc_clnt flag that an ULP can use to
    specify that the underlying transport should be disconnected on a
    major timeout. The NFSv4 client asserts this new flag, and requests
    no retries after a minor retransmit timeout.

    Note that disconnecting on a retransmit is in general not safe to do
    if the RPC client does not reuse the TCP port number when reconnecting.

    See http://bugzilla.linux-nfs.org/show_bug.cgi?id=6

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

11 Feb, 2007

1 commit


04 Feb, 2007

1 commit


08 Dec, 2006

1 commit


06 Dec, 2006

2 commits


22 Nov, 2006

1 commit

  • Pass the work_struct pointer to the work function rather than context data.
    The work function can use container_of() to work out the data.

    For the cases where the container of the work_struct may go away the moment the
    pending bit is cleared, it is made possible to defer the release of the
    structure by deferring the clearing of the pending bit.

    To make this work, an extra flag is introduced into the management side of the
    work_struct. This governs auto-release of the structure upon execution.

    Ordinarily, the work queue executor would release the work_struct for further
    scheduling or deallocation by clearing the pending bit prior to jumping to the
    work function. This means that, unless the driver makes some guarantee itself
    that the work_struct won't go away, the work function may not access anything
    else in the work_struct or its container lest they be deallocated.. This is a
    problem if the auxiliary data is taken away (as done by the last patch).

    However, if the pending bit is *not* cleared before jumping to the work
    function, then the work function *may* access the work_struct and its container
    with no problems. But then the work function must itself release the
    work_struct by calling work_release().

    In most cases, automatic release is fine, so this is the default. Special
    initiators exist for the non-auto-release case (ending in _NAR).

    Signed-Off-By: David Howells

    David Howells
     

29 Sep, 2006

1 commit


23 Sep, 2006

7 commits

  • In a subsequent patch, this will allow the portmapper to take a reference
    to the rpc_xprt for which it is updating the port number, fixing an Oops.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • - Ensure that the task aborts the RPC call only when it has actually timed out.
    - Ensure that req->rq_majortimeo is initialised correctly.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • The two function call API for creating a new RPC client is now obsolete.
    Remove it.

    Also, remove an unnecessary check to see whether the caller is capable of
    using privileged network services. The kernel RPC client always uses a
    privileged ephemeral port by default; callers are responsible for checking
    the authority of users to make use of any RPC service, or for specifying
    that a nonprivileged port is acceptable.

    Test plan:
    Repeated runs of Connectathon locking suite. Check network trace to ensure
    correctness of NLM requests and replies.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Prepare for more generic transport endpoint handling needed by transports
    that might use different forms of addressing, such as IPv6.

    Introduce a single function call to replace the two-call
    xprt_create_proto/rpc_create_client API. Define a new rpc_create_args
    structure that allows callers to pass in remote endpoint addresses of
    varying length.

    Test-plan:
    Compile kernel with CONFIG_NFS enabled.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • IPv6 addresses are big (128 bytes). Now that no RPC client consumers treat
    the addr field in rpc_xprt structs as an opaque, and access it only via the
    API calls, we can safely widen the field in the rpc_xprt struct to
    accomodate larger addresses.

    Test plan:
    Compile kernel with CONFIG_NFS enabled.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Move connection and bind state that was maintained in the rpc_clnt
    structure to the rpc_xprt structure. This will allow the creation of
    a clean API for plugging in different types of bind mechanisms.

    This brings improvements such as the elimination of a single spin lock to
    control serialization for all in-kernel RPC binding. A set of per-xprt
    bitops is used to serialize tasks during RPC binding, just like it now
    works for making RPC transport connections.

    Test-plan:
    Destructive testing (unplugging the network temporarily). Connectathon
    with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
    Probably need to rig a server where certain services aren't running, or
    that returns an error for some typical operation.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Hide the contents and format of xprt->addr by eliminating direct uses
    of the xprt->addr.sin_port field. This change is required to support
    alternate RPC host address formats (eg IPv6).

    Test-plan:
    Destructive testing (unplugging the network temporarily). Repeated runs of
    Connectathon locking suite with UDP and TCP.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

04 Aug, 2006

1 commit


22 Jul, 2006

1 commit


09 Jun, 2006

1 commit

  • The XID generator uses get_random_bytes to generate an initial XID.
    NFS_ROOT starts up before the random driver, though, so get_random_bytes
    doesn't set a random XID for NFS_ROOT. This causes NFS_ROOT mount points
    to reuse XIDs every time the client is booted. If the client boots often
    enough, the server will start serving old replies out of its DRC.

    Use net_random() instead.

    Test plan:
    I/O intensive workloads should perform well and generate no errors. Traces
    taken during client reboots should show that NFS_ROOT mounts use unique
    XIDs after every reboot.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

21 Mar, 2006

5 commits

  • We need to ensure that all writes to the XDR buffers are done before
    req->rq_received is visible to other processors.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • RPC_DEBUG_DATA no longer needed in net/sunrpc/xprt.c.

    Test plan:
    Compile kernel with CONFIG_NFS enabled.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Add a simple mechanism for collecting stats in the RPC client. Stats are
    tabulated during xprt_release. Note that per_cpu shenanigans are not
    required here because the RPC client already serializes on the transport
    write lock.

    Test plan:
    Compile kernel with CONFIG_NFS enabled. Basic performance regression
    testing with high-speed networking and high performance server.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Account for various things that occur while an RPC task is executed.
    Separate timers for RPC round trip and RPC execution time show how
    long RPC requests wait in queue before being sent. Eventually these
    will be accumulated at xprt_release time in one place where they can
    be viewed from userland.

    Test plan:
    Compile kernel with CONFIG_NFS enabled.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Monitor generic transport events. Add a transport switch callout to
    format transport counters for export to user-land.

    Test plan:
    Compile kernel with CONFIG_NFS enabled.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

07 Jan, 2006

3 commits

  • We ought never to be calling xprt_destroy() if there are still active
    rpc_tasks. Optimise away the broken code that attempts to "fix" that case.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • If the server decides to close the RPC socket, we currently don't actually
    respond until either another RPC call is scheduled, or until xprt_autoclose()
    gets called by the socket expiry timer (which may be up to 5 minutes
    later).

    This patch ensures that xprt_autoclose() is called much sooner if the
    server closes the socket.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Add RPC client transport switch support for replacing buffer management
    on a per-transport basis.

    In the current IPv4 socket transport implementation, RPC buffers are
    allocated as needed for each RPC message that is sent. Some transport
    implementations may choose to use pre-allocated buffers for encoding,
    sending, receiving, and unmarshalling RPC messages, however. For
    transports capable of direct data placement, the buffers can be carved
    out of a pre-registered area of memory rather than from a slab cache.

    Test-plan:
    Millions of fsx operations. Performance characterization with "sio" and
    "iozone". Use oprofile and other tools to look for significant regression
    in CPU utilization.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

19 Oct, 2005

2 commits


24 Sep, 2005

5 commits

  • Each transport implementation can now set unique bind, connect,
    reestablishment, and idle timeout values. These are variables,
    allowing the values to be modified dynamically. This permits
    exponential backoff of any of these values, for instance.

    As an example, we implement exponential backoff for the connection
    reestablishment timeout.

    Test-plan:
    Destructive testing (unplugging the network temporarily). Connectathon
    with UDP and TCP.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Clean-up: Move some macros that are specific to the Van Jacobson
    implementation into xprt.c. Get rid of the cong_wait field in
    rpc_xprt, which is no longer used. Get rid of xprt_clear_backlog.

    Test-plan:
    Compile with CONFIG_NFS enabled.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • The final place where congestion control state is adjusted is in
    xprt_release, where each request is finally released. Add a callout
    there to allow transports to perform additional processing when a
    request is about to be released.

    Test-plan:
    Use WAN simulation to cause sporadic bursty packet loss. Look for significant
    regression in performance or client stability.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • A new interface that allows transports to adjust their congestion window
    using the Van Jacobson implementation in xprt.c is provided.

    Test-plan:
    Use WAN simulation to cause sporadic bursty packet loss. Look for
    significant regression in performance or client stability.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Allow transports to hook the retransmit timer interrupt. Some transports
    calculate their congestion window here so that a retransmit timeout has
    immediate effect on the congestion window.

    Test-plan:
    Use WAN simulation to cause sporadic bursty packet loss. Look for significant
    regression in performance or client stability.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever