27 Oct, 2015
1 commit
-
commit c91aed9896946721bb30705ea2904edb3725dd61 upstream.
The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
were not taking into account the initial page_offset when determining
the rdma read length. This resulted in a read who's starting address
and length exceeded the base/bounds of the frmr.The server gets an async error from the rdma device and kills the
connection, and the client then reconnects and resends. This repeats
indefinitely, and the application hangs.Most work loads don't tickle this bug apparently, but one test hit it
every time: building the linux kernel on a 16 core node with 'make -j
16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.This bug seems to only be tripped with devices having small fastreg page
list depths. I didn't see it with mlx4, for instance.Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
Signed-off-by: Steve Wise
Tested-by: Chuck Lever
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman
23 Oct, 2015
1 commit
-
commit 9d11b51ce7c150a69e761e30518f294fc73d55ff upstream.
The Linux NFS server returns garbage in the data payload of inline
NFS/RDMA READ replies. These are READs of under 1000 bytes or so
where the client has not provided either a reply chunk or a write
list.The NFS server delivers the data payload for an NFS READ reply to
the transport in an xdr_buf page list. If the NFS client did not
provide a reply chunk or a write list, send_reply() is supposed to
set up a separate sge for the page containing the READ data, and
another sge for XDR padding if needed, then post all of the sges via
a single SEND Work Request.The problem is send_reply() does not advance through the xdr_buf
when setting up scatter/gather entries for SEND WR. It always calls
dma_map_xdr with xdr_off set to zero. When there's more than one
sge, dma_map_xdr() sets up the SEND sge's so they all point to the
xdr_buf's head.The current Linux NFS/RDMA client always provides a reply chunk or
a write list when performing an NFS READ over RDMA. Therefore, it
does not exercise this particular case. The Linux server has never
had to use more than one extra sge for building RPC/RDMA replies
with a Linux client.However, an NFS/RDMA client _is_ allowed to send small NFS READs
without setting up a write list or reply chunk. The NFS READ reply
fits entirely within the inline reply buffer in this case. This is
perhaps a more efficient way of performing NFS READs that the Linux
NFS/RDMA client may some day adopt.Fixes: b432e6b3d9c1 ('svcrdma: Change DMA mapping logic to . . .')
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=285
Signed-off-by: Chuck Lever
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman
27 Apr, 2015
1 commit
-
Pull NFS client updates from Trond Myklebust:
"Another set of mainly bugfixes and a couple of cleanups. No new
functionality in this round.Highlights include:
Stable patches:
- Fix a regression in /proc/self/mountstats
- Fix the pNFS flexfiles O_DIRECT support
- Fix high load average due to callback thread sleepingBugfixes:
- Various patches to fix the pNFS layoutcommit support
- Do not cache pNFS deviceids unless server notifications are enabled
- Fix a SUNRPC transport reconnection regression
- make debugfs file creation failure non-fatal in SUNRPC
- Another fix for circular directory warnings on NFSv4 "junctioned"
mountpoints
- Fix locking around NFSv4.2 fallocate() support
- Truncating NFSv4 file opens should also sync O_DIRECT writes
- Prevent infinite loop in rpcrdma_ep_create()Features:
- Various improvements to the RDMA transport code's handling of
memory registration
- Various code cleanups"* tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits)
fs/nfs: fix new compiler warning about boolean in switch
nfs: Remove unneeded casts in nfs
NFS: Don't attempt to decode missing directory entries
Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
NFS: Rename idmap.c to nfs4idmap.c
NFS: Move nfs_idmap.h into fs/nfs/
NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
NFS: Add a stub for GETDEVICELIST
nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
nfs: fix DIO good bytes calculation
nfs: Fetch MOUNTED_ON_FILEID when updating an inode
sunrpc: make debugfs file creation failure non-fatal
nfs: fix high load average due to callback thread sleeping
NFS: Reduce time spent holding the i_mutex during fallocate()
NFS: Don't zap caches on fallocate()
xprtrdma: Make rpcrdma_{un}map_one() into inline functions
xprtrdma: Handle non-SEND completions via a callout
xprtrdma: Add "open" memreg op
xprtrdma: Add "destroy MRs" memreg op
xprtrdma: Add "reset MRs" memreg op
...
15 Apr, 2015
1 commit
-
Pull trivial tree from Jiri Kosina:
"Usual trivial tree updates. Nothing outstanding -- mostly printk()
and comment fixes and unused identifier removals"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
goldfish: goldfish_tty_probe() is not using 'i' any more
powerpc: Fix comment in smu.h
qla2xxx: Fix printks in ql_log message
lib: correct link to the original source for div64_u64
si2168, tda10071, m88ds3103: Fix firmware wording
usb: storage: Fix printk in isd200_log_config()
qla2xxx: Fix printk in qla25xx_setup_mode
init/main: fix reset_device comment
ipwireless: missing assignment
goldfish: remove unreachable line of code
coredump: Fix do_coredump() comment
stacktrace.h: remove duplicate declaration task_struct
smpboot.h: Remove unused function prototype
treewide: Fix typo in printk messages
treewide: Fix typo in printk messages
mod_devicetable: fix comment for match_flags
31 Mar, 2015
14 commits
-
These functions are called in a loop for each page transferred via
RDMA READ or WRITE. Extract loop invariants and inline them to
reduce CPU overhead.Signed-off-by: Chuck Lever
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
Allow each memory registration mode to plug in a callout that handles
the completion of a memory registration operation.Signed-off-by: Chuck Lever
Reviewed-by: Sagi Grimberg
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
The open op determines the size of various transport data structures
based on device capabilities and memory registration mode.Signed-off-by: Chuck Lever
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
Memory Region objects associated with a transport instance are
destroyed before the instance is shutdown and destroyed.Signed-off-by: Chuck Lever
Reviewed-by: Sagi Grimberg
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
This method is invoked when a transport instance is about to be
reconnected. Each Memory Region object is reset to its initial
state.Signed-off-by: Chuck Lever
Reviewed-by: Sagi Grimberg
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
This method is used when setting up a new transport instance to
create a pool of Memory Region objects that will be used to register
memory during operation.Memory Regions are not needed for "physical" registration, since
->prepare and ->release are no-ops for that mode.Signed-off-by: Chuck Lever
Reviewed-by: Sagi Grimberg
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
There is very little common processing among the different external
memory deregistration functions.Signed-off-by: Chuck Lever
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
There is very little common processing among the different external
memory registration functions. Have rpcrdma_create_chunks() call
the registration method directly. This removes a stack frame and a
switch statement from the external registration path.Signed-off-by: Chuck Lever
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
The max_payload computation is generalized to ensure that the
payload maximum is the lesser of RPC_MAX_DATA_SEGS and the number of
data segments that can be transmitted in an inline buffer.Signed-off-by: Chuck Lever
Reviewed-by: Sagi Grimberg
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
Instead of employing switch() statements, let's use the typical
Linux kernel idiom for handling behavioral variation: virtual
functions.Start by defining a vector of operations for each supported memory
registration mode, and by adding a source file for each mode.Signed-off-by: Chuck Lever
Reviewed-by: Sagi Grimberg
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
If a provider advertizes a zero max_fast_reg_page_list_len, FRWR
depth detection loops forever. Instead of just failing the mount,
try other memory registration modes.Fixes: 0fc6c4e7bb28 ("xprtrdma: mind the device's max fast . . .")
Reported-by: Devesh Sharma
Signed-off-by: Chuck Lever
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
The RPC/RDMA transport's FRWR registration logic registers whole
pages. This means areas in the first and last pages that are not
involved in the RDMA I/O are needlessly exposed to the server.Buffered I/O is typically page-aligned, so not a problem there. But
for direct I/O, which can be byte-aligned, and for reply chunks,
which are nearly always smaller than a page, the transport could
expose memory outside the I/O buffer.FRWR allows byte-aligned memory registration, so let's use it as
it was intended.Reported-by: Sagi Grimberg
Signed-off-by: Chuck Lever
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
Commit 6ab59945f292 ("xprtrdma: Update rkeys after transport
reconnect" added logic in the ->send_request path to update the
chunk list when an RPC/RDMA request is retransmitted.Note that rpc_xdr_encode() resets and re-encodes the entire RPC
send buffer for each retransmit of an RPC. The RPC send buffer
is not preserved from the previous transmission of an RPC.Revert 6ab59945f292, and instead, just force each request to be
fully marshaled every time through ->send_request. This should
preserve the fix from 6ab59945f292, while also performing pullup
during retransmits.Signed-off-by: Chuck Lever
Acked-by: Sagi Grimberg
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker -
Signed-off-by: Chuck Lever
Reviewed-by: Sagi Grimberg
Tested-by: Devesh Sharma
Tested-by: Meghana Cheripady
Tested-by: Veeresh U. Kokatnur
Signed-off-by: Anna Schumaker
07 Mar, 2015
1 commit
-
This patch fix spelling typo in printk messages.
Signed-off-by: Masanari Iida
Acked-by: Randy Dunlap
Signed-off-by: Jiri Kosina
24 Feb, 2015
2 commits
-
NFS: RDMA Client Sparse Fix #2
This patch fixes another sparse fix found by Dan Carpenter's tool.
Signed-off-by: Anna Schumaker
* tag 'nfs-rdma-for-4.0-3' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
xprtrdma: Store RDMA credits in unsigned variables -
Dan Carpenter's static checker pointed out:
net/sunrpc/xprtrdma/rpc_rdma.c:879 rpcrdma_reply_handler()
warn: can 'credits' be negative?"credits" is defined as an int. The credits value comes from the
server as a 32-bit unsigned integer.A malicious or broken server can plant a large unsigned integer in
that field which would result in an underflow in the following
logic, potentially triggering a deadlock of the mount point by
blocking the client from issuing more RPC requests.net/sunrpc/xprtrdma/rpc_rdma.c:
876 credits = be32_to_cpu(headerp->rm_credit);
877 if (credits == 0)
878 credits = 1; /* don't deadlock */
879 else if (credits > r_xprt->rx_buf.rb_max_requests)
880 credits = r_xprt->rx_buf.rb_max_requests;
881
882 cwnd = xprt->cwnd;
883 xprt->cwnd = credits << RPC_CWNDSHIFT;
884 if (xprt->cwnd > cwnd)
885 xprt_release_rqst_cong(rqst->rq_task);Reported-by: Dan Carpenter
Fixes: eba8ff660b2d ("xprtrdma: Move credit update to RPC . . .")
Signed-off-by: Chuck Lever
Signed-off-by: Anna Schumaker
13 Feb, 2015
1 commit
-
Pull nfsd updates from Bruce Fields:
"The main change is the pNFS block server support from Christoph, which
allows an NFS client connected to shared disk to do block IO to the
shared disk in place of NFS reads and writes. This also requires xfs
patches, which should arrive soon through the xfs tree, barring
unexpected problems. Support for other filesystems is also possible
if there's interest.Thanks also to Chuck Lever for continuing work to get NFS/RDMA into
shape"* 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits)
nfsd: default NFSv4.2 to on
nfsd: pNFS block layout driver
exportfs: add methods for block layout exports
nfsd: add trace events
nfsd: update documentation for pNFS support
nfsd: implement pNFS layout recalls
nfsd: implement pNFS operations
nfsd: make find_any_file available outside nfs4state.c
nfsd: make find/get/put file available outside nfs4state.c
nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c
nfsd: add fh_fsid_match helper
nfsd: move nfsd_fh_match to nfsfh.h
fs: add FL_LAYOUT lease type
fs: track fl_owner for leases
nfs: add LAYOUT_TYPE_MAX enum value
nfsd: factor out a helper to decode nfstime4 values
sunrpc/lockd: fix references to the BKL
nfsd: fix year-2038 nfs4 state problem
svcrdma: Handle additional inline content
svcrdma: Move read list XDR round-up logic
...
06 Feb, 2015
1 commit
-
With "make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__":
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: warning: incorrect
type in initializer (different base types)
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: expected restricted
__be32 [usertype] *buffer
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: got unsigned int
[usertype] *rq_bufferAs far as I can tell this is a false positive.
Reported-by: kbuild-all@01.org
Signed-off-by: Chuck Lever
Signed-off-by: Anna Schumaker
31 Jan, 2015
1 commit
-
Reflect the more conservative approach used in the socket transport's
version of this transport method. An RPC buffer allocation should
avoid forcing not just FS activity, but any I/O.In particular, two recent changes missed updating xprtrdma:
- Commit c6c8fe79a83e ("net, sunrpc: suppress allocation warning ...")
- Commit a564b8f03986 ("nfs: enable swap on NFS")Signed-off-by: Chuck Lever
Signed-off-by: Anna Schumaker
30 Jan, 2015
16 commits
-
rpcrdma_{de}register_internal() are used only in verbs.c now.
MAX_RPCRDMAHDR is no longer used and can be removed.
Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Use the new rpcrdma_alloc_regbuf() API to shrink the amount of
contiguous memory needed for a buffer pool by moving the zero
pad buffer into a regbuf.This is for consistency with the other uses of internally
registered memory.Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
The rr_base field is currently the buffer where RPC replies land.
An RPC/RDMA reply header lands in this buffer. In some cases an RPC
reply header also lands in this buffer, just after the RPC/RDMA
header.The inline threshold is an agreed-on size limit for RDMA SEND
operations that pass from server and client. The sum of the
RPC/RDMA reply header size and the RPC reply header size must be
less than this threshold.The largest RDMA RECV that the client should have to handle is the
size of the inline threshold. The receive buffer should thus be the
size of the inline threshold, and not related to RPCRDMA_MAX_SEGS.RPC replies received via RDMA WRITE (long replies) are caught in
rq_rcv_buf, which is the second half of the RPC send buffer. Ie,
such replies are not involved in any way with rr_base.Signed-off-by: Chuck Lever
Signed-off-by: Anna Schumaker -
The rl_base field is currently the buffer where each RPC/RDMA call
header is built.The inline threshold is an agreed-on size limit to for RDMA SEND
operations that pass between client and server. The sum of the
RPC/RDMA header size and the RPC header size must be less than or
equal to this threshold.Increasing the r/wsize maximum will require MAX_SEGS to grow
significantly, but the inline threshold size won't change (both
sides agree on it). The server's inline threshold doesn't change.Since an RPC/RDMA header can never be larger than the inline
threshold, make all RPC/RDMA header buffers the size of the
inline threshold.Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Because internal memory registration is an expensive and synchronous
operation, xprtrdma pre-registers send and receive buffers at mount
time, and then re-uses them for each RPC.A "hardway" allocation is a memory allocation and registration that
replaces a send buffer during the processing of an RPC. Hardway must
be done if the RPC send buffer is too small to accommodate an RPC's
call and reply headers.For xprtrdma, each RPC send buffer is currently part of struct
rpcrdma_req so that xprt_rdma_free(), which is passed nothing but
the address of an RPC send buffer, can find its matching struct
rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof.That means that hardway currently has to replace a whole rpcrmda_req
when it replaces an RPC send buffer. This is often a fairly hefty
chunk of contiguous memory due to the size of the rl_segments array
and the fact that both the send and receive buffers are part of
struct rpcrdma_req.Some obscure re-use of fields in rpcrdma_req is done so that
xprt_rdma_free() can detect replaced rpcrdma_req structs, and
restore the original.This commit breaks apart the RPC send buffer and struct rpcrdma_req
so that increasing the size of the rl_segments array does not change
the alignment of each RPC send buffer. (Increasing rl_segments is
needed to bump up the maximum r/wsize for NFS/RDMA).This change opens up some interesting possibilities for improving
the design of xprt_rdma_allocate().xprt_rdma_allocate() is now the one place where RPC send buffers
are allocated or re-allocated, and they are now always left in place
by xprt_rdma_free().A large re-allocation that includes both the rl_segments array and
the RPC send buffer is no longer needed. Send buffer re-allocation
becomes quite rare. Good send buffer alignment is guaranteed no
matter what the size of the rl_segments array is.Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
There are several spots that allocate a buffer via kmalloc (usually
contiguously with another data structure) and then register that
buffer internally. I'd like to split the buffers out of these data
structures to allow the data structures to scale.Start by adding functions that can kmalloc and register a buffer,
and can manage/preserve the buffer's associated ib_sge and ib_mr
fields.Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Move the details of how to create and destroy rpcrdma_req and
rpcrdma_rep structures into helper functions.Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Clean up: There is one call site for rpcrdma_buffer_create(). All of
the arguments there are fields of an rpcrdma_xprt.Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Reduce stack footprint of the connection upcall handler function.
Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Device attributes are large, and are used in more than one place.
Stash a copy in dynamically allocated memory.Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
If ib_query_qp() fails or the memory registration mode isn't
supported, don't leak the PD. An orphaned IB/core resource will
cause IB module removal to hang.Fixes: bd7ed1d13304 ("RPC/RDMA: check selected memory registration ...")
Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Clean up: The rep_func field always refers to rpcrdma_conn_func().
rep_func should have been removed by commit b45ccfd25d50 ("xprtrdma:
Remove MEMWINDOWS registration modes").Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Reduce work in the receive CQ handler, which can be run at hardware
interrupt level, by moving the RPC/RDMA credit update logic to the
RPC reply handler.This has some additional benefits: More header sanity checking is
done before trusting the incoming credit value, and the receive CQ
handler no longer touches the RPC/RDMA header (the CPU stalls while
waiting for the header contents to be brought into the cache).This further extends work begun by commit e7ce710a8802 ("xprtrdma:
Avoid deadlock when credit window is reset").Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Clean up: Since commit 0ac531c18323 ("xprtrdma: Remove REGISTER
memory registration mode"), the rl_mr pointer is no longer used
anywhere.After removal, there's only a single member of the mr_chunk union,
so mr_chunk can be removed as well, in favor of a single pointer
field.Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Clean up: This field is not used.
Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker -
Clean up: Use consistent field names in struct rpcrdma_xprt.
Signed-off-by: Chuck Lever
Reviewed-by: Steve Wise
Signed-off-by: Anna Schumaker