01 Oct, 2020

5 commits

  • [ Upstream commit ea740bd5f58e2912e74f401fd01a9d6aa985ca05 ]

    Way back when I was writing the RPC/RDMA server-side backchannel
    code, I misread the TCP backchannel reply handler logic. When
    svc_tcp_recvfrom() successfully receives a backchannel reply, it
    does not return -EAGAIN. It sets XPT_DATA and returns zero.

    Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't
    need to be set again: it is set whenever a new message is received,
    behind a spin lock in a single threaded context.

    Also, if handling the cb reply is not successful, the message is
    simply dropped. There's no special message framing to deal with as
    there is in the TCP case.

    Now that the handle_bc_reply() return value is ignored, I've removed
    the dprintk call sites in the error exit of handle_bc_reply() in
    favor of trace points in other areas that already report the error
    cases.

    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit 1fab7dc477241c12f977955aa6baea7938b6f08d ]

    Move the test for whether a task is already queued to prevent
    corruption of the timer list in __rpc_sleep_on_priority_timeout().

    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit 1a33d8a284b1e85e03b8c7b1ea8fb985fccd1d71 ]

    Kernel memory leak detected:

    unreferenced object 0xffff888849cdf480 (size 8):
    comm "kworker/u8:3", pid 2086, jiffies 4297898756 (age 4269.856s)
    hex dump (first 8 bytes):
    30 00 cd 49 88 88 ff ff 0..I....
    backtrace:
    [] __kmalloc_track_caller+0x137/0x183
    [] kstrdup+0x2b/0x43
    [] xprt_rdma_format_addresses+0x114/0x17d [rpcrdma]
    [] xprt_setup_rdma_bc+0xc0/0x10c [rpcrdma]
    [] xprt_create_transport+0x3f/0x1a0 [sunrpc]
    [] rpc_create+0x118/0x1cd [sunrpc]
    [] setup_callback_client+0x1a5/0x27d [nfsd]
    [] nfsd4_process_cb_update.isra.7+0x16c/0x1ac [nfsd]
    [] nfsd4_run_cb_work+0x4c/0xbd [nfsd]
    [] process_one_work+0x1b2/0x2fe
    [] worker_thread+0x1a6/0x25a
    [] kthread+0xf6/0xfb
    [] ret_from_fork+0x3a/0x50

    Introduce a call to xprt_rdma_free_addresses() similar to the way
    that the TCP backchannel releases a transport's peer address
    strings.

    Fixes: 5d252f90a800 ("svcrdma: Add class for RDMA backwards direction transport")
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit b25b60d7bfb02a74bc3c2d998e09aab159df8059 ]

    'maxlen' is the total size of the destination buffer. There is only one
    caller and this value is 256.

    When we compute the size already used and what we would like to add in
    the buffer, the trailling NULL character is not taken into account.
    However, this trailling character will be added by the 'strcat' once we
    have checked that we have enough place.

    So, there is a off-by-one issue and 1 byte of the stack could be
    erroneously overwridden.

    Take into account the trailling NULL, when checking if there is enough
    place in the destination buffer.

    While at it, also replace a 'sprintf' by a safer 'snprintf', check for
    output truncation and avoid a superfluous 'strlen'.

    Fixes: dc9a16e49dbba ("svc: Add /proc/sys/sunrpc/transport files")
    Signed-off-by: Christophe JAILLET
    [ cel: very minor fix to documenting comment
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Christophe JAILLET
     
  • [ Upstream commit a264abad51d8ecb7954a2f6d9f1885b38daffc74 ]

    RPC tasks on the backchannel never invoke xprt_complete_rqst(), so
    there is no way to report their tk_status at completion. Also, any
    RPC task that exits via rpc_exit_task() before it is replied to will
    also disappear without a trace.

    Introduce a trace point that is symmetrical with rpc_task_begin that
    captures the termination status of each RPC task.

    Sample trace output for callback requests initiated on the server:
    kworker/u8:12-448 [003] 127.025240: rpc_task_end: task:50@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task
    kworker/u8:12-448 [002] 127.567310: rpc_task_end: task:51@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task
    kworker/u8:12-448 [001] 130.506817: rpc_task_end: task:52@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task

    Odd, though, that I never see trace_rpc_task_complete, either in the
    forward or backchannel. Should it be removed?

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Chuck Lever
     

23 Sep, 2020

1 commit

  • [ Upstream commit 8c6b6c793ed32b8f9770ebcdf1ba99af423c303b ]

    Since p points at raw xdr data, there's no guarantee that it's NULL
    terminated, so we should give a length. And probably escape any special
    characters too.

    Reported-by: Zhi Li
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    J. Bruce Fields
     

26 Aug, 2020

1 commit

  • [ Upstream commit 64d26422516b2e347b32e6d9b1d40b3c19a62aae ]

    During a connection tear down, the Receive queue is flushed before
    the device resources are freed. Typically, all the Receives flush
    with IB_WR_FLUSH_ERR.

    However, any pending successful Receives flush with IB_WR_SUCCESS,
    and the server automatically posts a fresh Receive to replace the
    completing one. This happens even after the connection has closed
    and the RQ is drained. Receives that are posted after the RQ is
    drained appear never to complete, causing a Receive resource leak.
    The leaked Receive buffer is left DMA-mapped.

    To prevent these late-posted recv_ctxt's from leaking, block new
    Receive posting after XPT_CLOSE is set.

    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     

19 Aug, 2020

2 commits

  • [ Upstream commit 986a4b63d3bc5f2c0eb4083b05aff2bf883b7b2f ]

    Braino when converting "buf->len -=" to "buf->len = len -".

    The result is under-estimation of the ralign and rslack values. On
    krb5p mounts, this has caused READDIR to fail with EIO, and KASAN
    splats when decoding READLINK replies.

    As a result of fixing this oversight, the gss_unwrap method now
    returns a buf->len that can be shorter than priv_len for small
    RPC messages. The additional adjustment done in unwrap_priv_data()
    can underflow buf->len. This causes the nfsd_request_too_large
    check to fail during some NFSv3 operations.

    Reported-by: Marian Rainer-Harbach
    Reported-by: Pierre Sauter
    BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1886277
    Fixes: 31c9590ae468 ("SUNRPC: Add "@len" parameter to gss_unwrap()")
    Reviewed-by: J. Bruce Fields
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit e814eecbe3bbeaa8b004d25a4b8974d232b765a9 ]

    Commit 07d0ff3b0cd2 ("svcrdma: Clean up Read chunk path") moved the
    page saver logic so that it gets executed event when an error occurs.
    In that case, the I/O is never posted, and those pages are then
    leaked. Errors in this path, however, are quite rare.

    Fixes: 07d0ff3b0cd2 ("svcrdma: Clean up Read chunk path")
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     

11 Aug, 2020

1 commit

  • commit 412055398b9e67e07347a936fc4a6adddabe9cf4 upstream.

    svcrdma expects that the payload falls precisely into the xdr_buf
    page vector. This does not seem to be the case for
    nfsd4_encode_readv().

    This code is called only when fops->splice_read is missing or when
    RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
    common cases.

    Add new transport method: ->xpo_read_payload so that when a READ
    payload does not fit exactly in rq_res's page vector, the XDR
    encoder can inform the RPC transport exactly where that payload is,
    without the payload's XDR pad.

    That way, when a Write chunk is present, the transport knows what
    byte range in the Reply message is supposed to be matched with the
    chunk.

    Note that the Linux NFS server implementation of NFS/RDMA can
    currently handle only one Write chunk per RPC-over-RDMA message.
    This simplifies the implementation of this fix.

    Fixes: b04209806384 ("nfsd4: allow exotic read compounds")
    Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
    Signed-off-by: Chuck Lever
    Cc: Timo Rothenpieler
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     

05 Aug, 2020

1 commit


22 Jul, 2020

1 commit

  • [ Upstream commit 912288442cb2f431bf3c8cb097a5de83bc6dbac1 ]

    Currently the header size calculations are using an assignment
    operator instead of a += operator when accumulating the header
    size leading to incorrect sizes. Fix this by using the correct
    operator.

    Addresses-Coverity: ("Unused value")
    Fixes: 302d3deb2068 ("xprtrdma: Prevent inline overflow")
    Signed-off-by: Colin Ian King
    Reviewed-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Sasha Levin

    Colin Ian King
     

01 Jul, 2020

3 commits

  • commit 7b2182ec381f8ea15c7eb1266d6b5d7da620ad93 upstream.

    The RPC client currently doesn't handle ERR_CHUNK replies correctly.
    rpcrdma_complete_rqst() incorrectly passes a negative number to
    xprt_complete_rqst() as the number of bytes copied. Instead, set
    task->tk_status to the error value, and return zero bytes copied.

    In these cases, return -EIO rather than -EREMOTEIO. The RPC client's
    finite state machine doesn't know what to do with -EREMOTEIO.

    Additional clean ups:
    - Don't double-count RDMA_ERROR replies
    - Remove a stale comment

    Signed-off-by: Chuck Lever
    Cc:
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit 89a3c9f5b9f0bcaa9aea3e8b2a616fcaea9aad78 upstream.

    @subbuf is an output parameter of xdr_buf_subsegment(). A survey of
    call sites shows that @subbuf is always uninitialized before
    xdr_buf_segment() is invoked by callers.

    There are some execution paths through xdr_buf_subsegment() that do
    not set all of the fields in @subbuf, leaving some pointer fields
    containing garbage addresses. Subsequent processing of that buffer
    then results in a page fault.

    Signed-off-by: Chuck Lever
    Cc:
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit b7ade38165ca0001c5a3bd5314a314abbbfbb1b7 upstream.

    __rpc_depopulate(gssd_dentry) was lost on error path

    cc: stable@vger.kernel.org
    Fixes: commit 4b9a445e3eeb ("sunrpc: create a new dummy pipe for gssd to hold open")
    Signed-off-by: Vasily Averin
    Reviewed-by: Jeff Layton
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     

24 Jun, 2020

1 commit

  • [ Upstream commit 118917d696dc59fd3e1741012c2f9db2294bed6f ]

    Fix off-by-one issues in 'rpc_ntop6':
    - 'snprintf' returns the number of characters which would have been
    written if enough space had been available, excluding the terminating
    null byte. Thus, a return value of 'sizeof(scopebuf)' means that the
    last character was dropped.
    - 'strcat' adds a terminating null byte to the string, thus if len ==
    buflen, the null byte is written past the end of the buffer.

    Signed-off-by: Fedor Tokarev
    Signed-off-by: Anna Schumaker
    Signed-off-by: Sasha Levin

    Fedor Tokarev
     

22 Jun, 2020

2 commits

  • commit 24c5efe41c29ee3e55bcf5a1c9f61ca8709622e8 upstream.

    gss_mech_register() calls svcauth_gss_register_pseudoflavor() for each
    flavour, but gss_mech_unregister() does not call auth_domain_put().
    This is unbalanced and makes it impossible to reload the module.

    Change svcauth_gss_register_pseudoflavor() to return the registered
    auth_domain, and save it for later release.

    Cc: stable@vger.kernel.org (v2.6.12+)
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651
    Signed-off-by: NeilBrown
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit d47a5dc2888fd1b94adf1553068b8dad76cec96c upstream.

    There is no valid case for supporting duplicate pseudoflavor
    registrations.
    Currently the silent acceptance of such registrations is hiding a bug.
    The rpcsec_gss_krb5 module registers 2 flavours but does not unregister
    them, so if you load, unload, reload the module, it will happily
    continue to use the old registration which now has pointers to the
    memory were the module was originally loaded. This could lead to
    unexpected results.

    So disallow duplicate registrations.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651
    Cc: stable@vger.kernel.org (v2.6.12+)
    Signed-off-by: NeilBrown
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     

20 May, 2020

4 commits

  • commit 0a8e7b7d08466b5fc52f8e96070acc116d82a8bb upstream.

    I've noticed that when krb5i or krb5p security is in use,
    retransmitted requests are missing the server's duplicate reply
    cache. The computed checksum on the retransmitted request does not
    match the cached checksum, resulting in the server performing the
    retransmitted request again instead of returning the cached reply.

    The assumptions made when removing xdr_buf_trim() were not correct.
    In the send paths, the upper layer has already set the segment
    lengths correctly, and shorting the buffer's content is simply a
    matter of reducing buf->len.

    xdr_buf_trim() is the right answer in the receive/unwrap path on
    both the client and the server. The buffer segment lengths have to
    be shortened one-by-one.

    On the server side in particular, head.iov_len needs to be updated
    correctly to enable nfsd_cache_csum() to work correctly. The simple
    buf->len computation doesn't do that, and that results in
    checksumming stale data in the buffer.

    The problem isn't noticed until there's significant instability of
    the RPC transport. At that point, the reliability of retransmit
    detection on the server becomes crucial.

    Fixes: 241b1f419f0e ("SUNRPC: Remove xdr_buf_trim()")
    Signed-off-by: Chuck Lever
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • [ Upstream commit ce99aa62e1eb793e259d023c7f6ccb7c4879917b ]

    Ensure that signalled ASYNC rpc_tasks exit immediately instead of
    spinning until a timeout (or forever).

    To avoid checking for the signal flag on every scheduler iteration,
    the check is instead introduced in the client's finite state
    machine.

    Signed-off-by: Chuck Lever
    Fixes: ae67bd3821bb ("SUNRPC: Fix up task signalling")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit a7e429a6fa6d612d1dacde96c885dc1bb4a9f400 ]

    When the au_ralign field was added to gss_unwrap_resp_priv, the
    wrong calculation was used. Setting au_rslack == au_ralign is
    probably correct for kerberos_v1 privacy, but kerberos_v2 privacy
    adds additional GSS data after the clear text RPC message.
    au_ralign needs to be smaller than au_rslack in that fairly common
    case.

    When xdr_buf_trim() is restored to gss_unwrap_kerberos_v2(), it does
    exactly what I feared it would: it trims off part of the clear text
    RPC message. However, that's because rpc_prepare_reply_pages() does
    not set up the rq_rcv_buf's tail correctly because au_ralign is too
    large.

    Fixing the au_ralign computation also corrects the alignment of
    rq_rcv_buf->pages so that the client does not have to shift reply
    data payloads after they are received.

    Fixes: 35e77d21baa0 ("SUNRPC: Add rpc_auth::au_ralign field")
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit 31c9590ae468478fe47dc0f5f0d3562b2f69450e ]

    Refactor: This is a pre-requisite to fixing the client-side ralign
    computation in gss_unwrap_resp_priv().

    The length value is passed in explicitly rather that as the value
    of buf->len. This will subsequently allow gss_unwrap_kerberos_v1()
    to compute a slack and align value, instead of computing it in
    gss_unwrap_resp_priv().

    Fixes: 35e77d21baa0 ("SUNRPC: Add rpc_auth::au_ralign field")
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     

02 May, 2020

2 commits

  • commit 23cf1ee1f1869966b75518c59b5cbda4c6c92450 upstream.

    Utilize the xpo_release_rqst transport method to ensure that each
    rqstp's svc_rdma_recv_ctxt object is released even when the server
    cannot return a Reply for that rqstp.

    Without this fix, each RPC whose Reply cannot be sent leaks one
    svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped
    Receive buffer, and any pages that might be part of the Reply
    message.

    The leak is infrequent unless the network fabric is unreliable or
    Kerberos is in use, as GSS sequence window overruns, which result
    in connection loss, are more common on fast transports.

    Fixes: 3a88092ee319 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto")
    Signed-off-by: Chuck Lever
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit e28b4fc652c1830796a4d3e09565f30c20f9a2cf upstream.

    I hit this while testing nfsd-5.7 with kernel memory debugging
    enabled on my server:

    Mar 30 13:21:45 klimt kernel: BUG: unable to handle page fault for address: ffff8887e6c279a8
    Mar 30 13:21:45 klimt kernel: #PF: supervisor read access in kernel mode
    Mar 30 13:21:45 klimt kernel: #PF: error_code(0x0000) - not-present page
    Mar 30 13:21:45 klimt kernel: PGD 3601067 P4D 3601067 PUD 87c519067 PMD 87c3e2067 PTE 800ffff8193d8060
    Mar 30 13:21:45 klimt kernel: Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
    Mar 30 13:21:45 klimt kernel: CPU: 2 PID: 1933 Comm: nfsd Not tainted 5.6.0-rc6-00040-g881e87a3c6f9 #1591
    Mar 30 13:21:45 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
    Mar 30 13:21:45 klimt kernel: RIP: 0010:svc_rdma_post_chunk_ctxt+0xab/0x284 [rpcrdma]
    Mar 30 13:21:45 klimt kernel: Code: c1 83 34 02 00 00 29 d0 85 c0 7e 72 48 8b bb a0 02 00 00 48 8d 54 24 08 4c 89 e6 48 8b 07 48 8b 40 20 e8 5a 5c 2b e1 41 89 c6 45 20 89 44 24 04 8b 05 02 e9 01 00 85 c0 7e 33 e9 5e 01 00 00
    Mar 30 13:21:45 klimt kernel: RSP: 0018:ffffc90000dfbdd8 EFLAGS: 00010286
    Mar 30 13:21:45 klimt kernel: RAX: 0000000000000000 RBX: ffff8887db8db400 RCX: 0000000000000030
    Mar 30 13:21:45 klimt kernel: RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000246
    Mar 30 13:21:45 klimt kernel: RBP: ffff8887e6c27988 R08: 0000000000000000 R09: 0000000000000004
    Mar 30 13:21:45 klimt kernel: R10: ffffc90000dfbdd8 R11: 00c068ef00000000 R12: ffff8887eb4e4a80
    Mar 30 13:21:45 klimt kernel: R13: ffff8887db8db634 R14: 0000000000000000 R15: ffff8887fc931000
    Mar 30 13:21:45 klimt kernel: FS: 0000000000000000(0000) GS:ffff88885bd00000(0000) knlGS:0000000000000000
    Mar 30 13:21:45 klimt kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 CR3: 000000081b72e002 CR4: 00000000001606e0
    Mar 30 13:21:45 klimt kernel: Call Trace:
    Mar 30 13:21:45 klimt kernel: ? svc_rdma_vec_to_sg+0x7f/0x7f [rpcrdma]
    Mar 30 13:21:45 klimt kernel: svc_rdma_send_write_chunk+0x59/0xce [rpcrdma]
    Mar 30 13:21:45 klimt kernel: svc_rdma_sendto+0xf9/0x3ae [rpcrdma]
    Mar 30 13:21:45 klimt kernel: ? nfsd_destroy+0x51/0x51 [nfsd]
    Mar 30 13:21:45 klimt kernel: svc_send+0x105/0x1e3 [sunrpc]
    Mar 30 13:21:45 klimt kernel: nfsd+0xf2/0x149 [nfsd]
    Mar 30 13:21:45 klimt kernel: kthread+0xf6/0xfb
    Mar 30 13:21:45 klimt kernel: ? kthread_queue_delayed_work+0x74/0x74
    Mar 30 13:21:45 klimt kernel: ret_from_fork+0x3a/0x50
    Mar 30 13:21:45 klimt kernel: Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue ib_umad ib_ipoib mlx4_ib sb_edac x86_pkg_temp_thermal iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd pcspkr rpcrdma i2c_i801 rdma_ucm lpc_ich mfd_core ib_iser rdma_cm iw_cm ib_cm mei_me raid0 libiscsi mei sg scsi_transport_iscsi ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c mlx4_en sd_mod sr_mod cdrom mlx4_core crc32c_intel igb nvme i2c_algo_bit ahci i2c_core libahci nvme_core dca libata t10_pi qedr dm_mirror dm_region_hash dm_log dm_mod dax qede qed crc8 ib_uverbs ib_core
    Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8
    Mar 30 13:21:45 klimt kernel: ---[ end trace 87971d2ad3429424 ]---

    It's absolutely not safe to use resources pointed to by the @send_wr
    argument of ib_post_send() _after_ that function returns. Those
    resources are typically freed by the Send completion handler, which
    can run before ib_post_send() returns.

    Thus the trace points currently around ib_post_send() in the
    server's RPC/RDMA transport are a hazard, even when they are
    disabled. Rearrange them so that they touch the Work Request only
    _before_ ib_post_send() is invoked.

    Fixes: bd2abef33394 ("svcrdma: Trace key RDMA API events")
    Fixes: 4201c7464753 ("svcrdma: Introduce svc_rdma_send_ctxt")
    Signed-off-by: Chuck Lever
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     

29 Apr, 2020

1 commit

  • commit 6221f1d9b63fed6260273e59a2b89ab30537a811 upstream.

    Currently, after the forward channel connection goes away,
    backchannel operations are causing soft lockups on the server
    because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
    Such backchannel Calls are aggressively retried until the client
    reconnects.

    Backchannel Calls should use RPC_TASK_NOCONNECT rather than
    RPC_TASK_SOFTCONN. If there is no forward connection, the server is
    not capable of establishing a connection back to the client, thus
    that backchannel request should fail before the server attempts to
    send it. Commit 58255a4e3ce5 ("NFSD: NFSv4 callback client should
    use RPC_TASK_SOFTCONN") was merged several years before
    RPC_TASK_NOCONNECT was available.

    Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
    callback connection depends on the first callback RPC to initiate
    a connection to the client. Thus NFSv4.0 needs to continue to use
    RPC_TASK_SOFTCONN.

    Suggested-by: Trond Myklebust
    Signed-off-by: Chuck Lever
    Cc: # v4.20+
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     

23 Apr, 2020

2 commits

  • [ Upstream commit 4047aa909c4a40fceebc36fff708d465a4d3c6e2 ]

    xdr_buf_read_mic() tries to find unused contiguous space in a
    received xdr_buf in order to linearize the checksum for the call
    to gss_verify_mic. However, the corner cases in this code are
    numerous and we seem to keep missing them. I've just hit yet
    another buffer overrun related to it.

    This overrun is at the end of xdr_buf_read_mic():

    1284 if (buf->tail[0].iov_len != 0)
    1285 mic->data = buf->tail[0].iov_base + buf->tail[0].iov_len;
    1286 else
    1287 mic->data = buf->head[0].iov_base + buf->head[0].iov_len;
    1288 __read_bytes_from_xdr_buf(&subbuf, mic->data, mic->len);
    1289 return 0;

    This logic assumes the transport has set the length of the tail
    based on the size of the received message. base + len is then
    supposed to be off the end of the message but still within the
    actual buffer.

    In fact, the length of the tail is set by the upper layer when the
    Call is encoded so that the end of the tail is actually the end of
    the allocated buffer itself. This causes the logic above to set
    mic->data to point past the end of the receive buffer.

    The "mic->data = head" arm of this if statement is no less fragile.

    As near as I can tell, this has been a problem forever. I'm not sure
    that minimizing au_rslack recently changed this pathology much.

    So instead, let's use a more straightforward approach: kmalloc a
    separate buffer to linearize the checksum. This is similar to
    how gss_validate() currently works.

    Coming back to this code, I had some trouble understanding what
    was going on. So I've cleaned up the variable naming and added
    a few comments that point back to the XDR definition in RFC 2203
    to help guide future spelunkers, including myself.

    As an added clean up, the functionality that was in
    xdr_buf_read_mic() is folded directly into gss_unwrap_resp_integ(),
    as that is its only caller.

    Signed-off-by: Chuck Lever
    Reviewed-by: Benjamin Coddington
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit df513a7711712758b9cb1a48d86712e7e1ee03f4 ]

    Ever since commit 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing
    reply buffer size"). It changed how "req->rq_rcvsize" is calculated. It
    used to use au_cslack value which was nice and large and changed it to
    au_rslack value which turns out to be too small.

    Since 5.1, v3 mount with sec=krb5p fails against an Ontap server
    because client's receive buffer it too small.

    For gss krb5p, we need to account for the mic token in the verifier,
    and the wrap token in the wrap token.

    RFC 4121 defines:
    mic token
    Octet no Name Description
    --------------------------------------------------------------
    0..1 TOK_ID Identification field. Tokens emitted by
    GSS_GetMIC() contain the hex value 04 04
    expressed in big-endian order in this
    field.
    2 Flags Attributes field, as described in section
    4.2.2.
    3..7 Filler Contains five octets of hex value FF.
    8..15 SND_SEQ Sequence number field in clear text,
    expressed in big-endian order.
    16..last SGN_CKSUM Checksum of the "to-be-signed" data and
    octet 0..15, as described in section 4.2.4.

    that's 16bytes (GSS_KRB5_TOK_HDR_LEN) + chksum

    wrap token
    Octet no Name Description
    --------------------------------------------------------------
    0..1 TOK_ID Identification field. Tokens emitted by
    GSS_Wrap() contain the hex value 05 04
    expressed in big-endian order in this
    field.
    2 Flags Attributes field, as described in section
    4.2.2.
    3 Filler Contains the hex value FF.
    4..5 EC Contains the "extra count" field, in big-
    endian order as described in section 4.2.3.
    6..7 RRC Contains the "right rotation count" in big-
    endian order, as described in section
    4.2.5.
    8..15 SND_SEQ Sequence number field in clear text,
    expressed in big-endian order.
    16..last Data Encrypted data for Wrap tokens with
    confidentiality, or plaintext data followed
    by the checksum for Wrap tokens without
    confidentiality, as described in section
    4.2.4.

    Also 16bytes of header (GSS_KRB5_TOK_HDR_LEN), encrypted data, and cksum
    (other things like padding)

    RFC 3961 defines known cksum sizes:
    Checksum type sumtype checksum section or
    value size reference
    ---------------------------------------------------------------------
    CRC32 1 4 6.1.3
    rsa-md4 2 16 6.1.2
    rsa-md4-des 3 24 6.2.5
    des-mac 4 16 6.2.7
    des-mac-k 5 8 6.2.8
    rsa-md4-des-k 6 16 6.2.6
    rsa-md5 7 16 6.1.1
    rsa-md5-des 8 24 6.2.4
    rsa-md5-des3 9 24 ??
    sha1 (unkeyed) 10 20 ??
    hmac-sha1-des3-kd 12 20 6.3
    hmac-sha1-des3 13 20 ??
    sha1 (unkeyed) 14 20 ??
    hmac-sha1-96-aes128 15 20 [KRB5-AES]
    hmac-sha1-96-aes256 16 20 [KRB5-AES]
    [reserved] 0x8003 ? [GSS-KRB5]

    Linux kernel now mainly supports type 15,16 so max cksum size is 20bytes.
    (GSS_KRB5_MAX_CKSUM_LEN)

    Re-use already existing define of GSS_KRB5_MAX_SLACK_NEEDED that's used
    for encoding the gss_wrap tokens (same tokens are used in reply).

    Fixes: 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing reply buffer size")
    Signed-off-by: Olga Kornievskaia
    Reviewed-by: Chuck Lever
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Olga Kornievskaia
     

24 Feb, 2020

1 commit


20 Feb, 2020

1 commit

  • commit ca1c671302825182629d3c1a60363cee6f5455bb upstream.

    The @nents value that was passed to ib_dma_map_sg() has to be passed
    to the matching ib_dma_unmap_sg() call. If ib_dma_map_sg() choses to
    concatenate sg entries, it will return a different nents value than
    it was passed.

    The bug was exposed by recent changes to the AMD IOMMU driver, which
    enabled sg entry concatenation.

    Looking all the way back to commit 4143f34e01e9 ("xprtrdma: Port to
    new memory registration API") and reviewing other kernel ULPs, it's
    not clear that the frwr_map() logic was ever correct for this case.

    Reported-by: Andre Tomt
    Suggested-by: Robin Murphy
    Signed-off-by: Chuck Lever
    Cc: stable@vger.kernel.org
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     

11 Feb, 2020

1 commit

  • commit 3d96208c30f84d6edf9ab4fac813306ac0d20c10 upstream.

    When upcalling gssproxy, cache_head.expiry_time is set as a
    timeval, not seconds since boot. As such, RPC cache expiry
    logic will not clean expired objects created under
    auth.rpcsec.context cache.

    This has proven to cause kernel memory leaks on field. Using
    64 bit variants of getboottime/timespec

    Expiration times have worked this way since 2010's c5b29f885afe "sunrpc:
    use seconds since boot in expiry cache". The gssproxy code introduced
    in 2012 added gss_proxy_save_rsc and introduced the bug. That's a while
    for this to lurk, but it required a bit of an extreme case to make it
    obvious.

    Signed-off-by: Roberto Bergantinos Corpas
    Cc: stable@vger.kernel.org
    Fixes: 030d794bf498 "SUNRPC: Use gssproxy upcall for server..."
    Tested-By: Frank Sorenson
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Roberto Bergantinos Corpas
     

26 Jan, 2020

3 commits

  • [ Upstream commit e8d70b321ecc9b23d09b8df63e38a2f73160c209 ]

    xdr_shrink_pagelen() BUG's when @len is larger than buf->page_len.
    This can happen when xdr_buf_read_mic() is given an xdr_buf with
    a small page array (like, only a few bytes).

    Instead, just cap the number of bytes that xdr_shrink_pagelen()
    will move.

    Fixes: 5f1bc39979d ("SUNRPC: Fix buffer handling of GSS MIC ... ")
    Signed-off-by: Chuck Lever
    Reviewed-by: Benjamin Coddington
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • commit 8729aaba74626c4ebce3abf1b9e96bb62d2958ca upstream.

    I noticed that for callback requests, the reported backlog latency
    is always zero, and the rtt value is crazy big. The problem was that
    rqst->rq_xtime is never set for backchannel requests.

    Fixes: 78215759e20d ("SUNRPC: Make RTT measurement more ... ")
    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit 5866efa8cbfbadf3905072798e96652faf02dbe8 upstream.

    gss_read_proxy_verf() assumes things about the XDR buffer containing
    the RPC Call that are not true for buffers generated by
    svc_rdma_recv().

    RDMA's buffers look more like what the upper layer generates for
    sending: head is a kmalloc'd buffer; it does not point to a page
    whose contents are contiguous with the first page in the buffers'
    page array. The result is that ACCEPT_SEC_CONTEXT via RPC/RDMA has
    stopped working on Linux NFS servers that use gssproxy.

    This does not affect clients that use only TCP to send their
    ACCEPT_SEC_CONTEXT operation (that's all Linux clients). Other
    clients, like Solaris NFS clients, send ACCEPT_SEC_CONTEXT on the
    same transport as they send all other NFS operations. Such clients
    can send ACCEPT_SEC_CONTEXT via RPC/RDMA.

    I thought I had found every direct reference in the server RPC code
    to the rqstp->rq_pages field.

    Bug found at the 2019 Westford NFS bake-a-thon.

    Fixes: 3316f0631139 ("svcrdma: Persistently allocate and DMA- ... ")
    Signed-off-by: Chuck Lever
    Tested-by: Bill Baker
    Reviewed-by: Simo Sorce
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     

18 Jan, 2020

7 commits

  • commit 671c450b6fe0680ea1cb1cf1526d764fdd5a3d3f upstream.

    Since v5.4, a device removal occasionally triggered this oops:

    Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219
    Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode
    Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page
    Dec 2 17:13:53 manet kernel: PGD 0 P4D 0
    Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP
    Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883
    Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
    Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
    Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma]
    Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0
    Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246
    Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048
    Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001
    Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000
    Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428
    Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010
    Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
    Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0
    Dec 2 17:13:53 manet kernel: Call Trace:
    Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core]
    Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core]
    Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd
    Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
    Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a
    Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
    Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9
    Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74
    Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30

    The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that
    is still pointing to the old ib_device, which has been freed. The
    only way that is possible is if this rpcrdma_rep was not destroyed
    by rpcrdma_ia_remove.

    Debugging showed that was indeed the case: this rpcrdma_rep was
    still in use by a completing RPC at the time of the device removal,
    and thus wasn't on the rep free list. So, it was not found by
    rpcrdma_reps_destroy().

    The fix is to introduce a list of all rpcrdma_reps so that they all
    can be found when a device is removed. That list is used to perform
    only regbuf DMA unmapping, replacing that call to
    rpcrdma_reps_destroy().

    Meanwhile, to prevent corruption of this list, I've moved the
    destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs().
    rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is
    not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus
    protecting the rb_all_reps list.

    Fixes: b0b227f071a0 ("xprtrdma: Use an llist to manage free rpcrdma_reps")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit 13cb886c591f341a8759f175292ddf978ef903a1 upstream.

    I've found that on occasion, "rmmod " will hang while if an NFS
    is under load.

    Ensure that ri_remove_done is initialized only just before the
    transport is woken up to force a close. This avoids the completion
    possibly getting initialized again while the CM event handler is
    waiting for a wake-up.

    Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit b32b9ed493f938e191f790a0991d20b18b38c35b upstream.

    On device re-insertion, the RDMA device driver crashes trying to set
    up a new QP:

    Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0
    Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode
    Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page
    Nov 27 16:32:06 manet kernel: PGD 0 P4D 0
    Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP
    Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G W 5.4.0 #852
    Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
    Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
    Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12
    Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00
    Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046
    Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000
    Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0
    Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000
    Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8
    Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000
    Nov 27 16:32:06 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000
    Nov 27 16:32:06 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0
    Nov 27 16:32:06 manet kernel: Call Trace:
    Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a
    Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib]
    Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a
    Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139
    Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib]

    The fix is to copy the qp_init_attr struct that was just created by
    rpcrdma_ep_create() instead of using the one from the previous
    connection instance.

    Fixes: 98ef77d1aaa7 ("xprtrdma: Send Queue size grows after a reconnect")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit 2ae50ad68cd79224198b525f7bd645c9da98b6ff upstream.

    A recent clean up attempted to separate Receive handling and RPC
    Reply processing, in the name of clean layering.

    Unfortunately, we can't do this because the Receive Queue has to be
    refilled _after_ the most recent credit update from the responder
    is parsed from the transport header, but _before_ we wake up the
    next RPC sender. That is right in the middle of
    rpcrdma_reply_handler().

    Usually this isn't a problem because current responder
    implementations don't vary their credit grant. The one exception is
    when a connection is established: the grant goes from one to a much
    larger number on the first Receive. The requester MUST post enough
    Receives right then so that any outstanding requests can be sent
    without risking RNR and connection loss.

    Fixes: 6ceea36890a0 ("xprtrdma: Refactor Receive accounting")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit c3700780a096fc66467c81076ddf7f3f11d639b5 upstream.

    Close some holes introduced by commit 6dc6ec9e04c4 ("xprtrdma: Cache
    free MRs in each rpcrdma_req") that could result in list corruption.

    In addition, the result that is tabulated in @count is no longer
    used, so @count is removed.

    Fixes: 6dc6ec9e04c4 ("xprtrdma: Cache free MRs in each rpcrdma_req")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit a31b2f939219dd9bffdf01a45bd91f209f8cc369 upstream.

    This is because xprt_request_get_cong() is allowing more than one
    RPC Call to be transmitted before the first Receive on the new
    connection. The first Receive fills the Receive Queue based on the
    server's credit grant. Before that Receive, there is only a single
    Receive WR posted because the client doesn't know the server's
    credit grant.

    Solution is to clear rq_cong on all outstanding rpc_rqsts when the
    the cwnd is reset. This is because an RPC/RDMA credit is good for
    one connection instance only.

    Fixes: 75891f502f5f ("SUNRPC: Support for congestion control ... ")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit 4b93dab36f28e673725e5e6123ebfccf7697f96a upstream.

    When adding frwr_unmap_async way back when, I re-used the existing
    trace_xprtrdma_post_send() trace point to record the return code
    of ib_post_send.

    Unfortunately there are some cases where re-using that trace point
    causes a crash. Instead, construct a trace point specific to posting
    Local Invalidate WRs that will always be safe to use in that context,
    and will act as a trace log eye-catcher for Local Invalidation.

    Fixes: 847568942f93 ("xprtrdma: Remove fr_state")
    Fixes: d8099feda483 ("xprtrdma: Reduce context switching due ... ")
    Signed-off-by: Chuck Lever
    Tested-by: Bill Baker
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever