17 Nov, 2019

1 commit

  • Currently WR sizes are updated from rds_ib_sysctl_max_send_wr and
    rds_ib_sysctl_max_recv_wr when a connection is shut down. As a result,
    a connection being down while rds_ib_sysctl_max_send_wr or
    rds_ib_sysctl_max_recv_wr are updated, will not update the sizes when
    it comes back up.

    Move resizing of WRs to rds_ib_setup_qp so that connections will be setup
    with the most current WR sizes.

    Signed-off-by: Dag Moxnes
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Dag Moxnes
     

03 Oct, 2019

1 commit

  • rds_ibdev:ipaddr_list and rds_ibdev:conn_list are initialized
    after allocation some resources such as protection domain.
    If allocation of such resources fail, then these uninitialized
    variables are accessed in rds_ib_dev_free() in failure path. This
    can potentially crash the system. The code has been updated to
    initialize these variables very early in the function.

    Signed-off-by: Dotan Barak
    Signed-off-by: Sudhakar Dindukurti
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Dotan Barak
     

27 Sep, 2019

1 commit

  • In rds_bind(), laddr_check is called without checking if it is NULL or
    not. And rs_transport should be reset if rds_add_bound() fails.

    Fixes: c5c1a030a7db ("net/rds: An rds_sock is added too early to the hash table")
    Reported-by: syzbot+fae39afd2101a17ec624@syzkaller.appspotmail.com
    Signed-off-by: Ka-Cheong Poon
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     

26 Sep, 2019

1 commit


18 Sep, 2019

1 commit


16 Sep, 2019

1 commit

  • All entries in 'rds_ib_stat_names' are stringified versions
    of the corresponding "struct rds_ib_statistics" element
    without the "s_"-prefix.

    Fix entry 'ib_evt_handler_call' to do the same.

    Fixes: f4f943c958a2 ("RDS: IB: ack more receive completions to improve performance")
    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     

15 Sep, 2019

1 commit


11 Sep, 2019

1 commit

  • In rds_bind(), an rds_sock is added to the RDS bind hash table before
    rs_transport is set. This means that the socket can be found by the
    receive code path when rs_transport is NULL. And the receive code
    path de-references rs_transport for congestion update check. This can
    cause a panic. An rds_sock should not be added to the bind hash table
    before all the needed fields are set.

    Reported-by: syzbot+4b4f8163c2e246df3c4c@syzkaller.appspotmail.com
    Signed-off-by: Ka-Cheong Poon
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     

05 Sep, 2019

1 commit

  • IN_MULTICAST's primary intent is as a uapi macro.

    Elsewhere in the kernel we use ipv4_is_multicast consistently.

    This patch unifies linux's multicast checks to use that function
    rather than this macro.

    Signed-off-by: Dave Taht
    Reviewed-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Dave Taht
     

03 Sep, 2019

1 commit


28 Aug, 2019

2 commits


25 Aug, 2019

1 commit

  • >From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
    is used to identify different flows within an IBA subnet.
    It is carried in the local route header of the packet.

    Before this commit, run "rds-info -I". The outputs are as
    below:
    "
    RDS IB Connections:
    LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
    192.2.95.3 192.2.95.1 2 0 fe80::21:28:1a:39 fe80::21:28:10:b9
    192.2.95.3 192.2.95.1 1 0 fe80::21:28:1a:39 fe80::21:28:10:b9
    192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9
    "
    After this commit, the output is as below:
    "
    RDS IB Connections:
    LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
    192.2.95.3 192.2.95.1 2 2 fe80::21:28:1a:39 fe80::21:28:10:b9
    192.2.95.3 192.2.95.1 1 1 fe80::21:28:1a:39 fe80::21:28:10:b9
    192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9
    "

    The commit fe3475af3bdf ("net: rds: add per rds connection cache
    statistics") adds cache_allocs in struct rds_info_rdma_connection
    as below:
    struct rds_info_rdma_connection {
    ...
    __u32 rdma_mr_max;
    __u32 rdma_mr_size;
    __u8 tos;
    __u32 cache_allocs;
    };
    The peer struct in rds-tools of struct rds_info_rdma_connection is as
    below:
    struct rds_info_rdma_connection {
    ...
    uint32_t rdma_mr_max;
    uint32_t rdma_mr_size;
    uint8_t tos;
    uint8_t sl;
    uint32_t cache_allocs;
    };
    The difference between userspace and kernel is the member variable sl.
    In the kernel struct, the member variable sl is missing. This will
    introduce risks. So it is necessary to use this commit to avoid this risk.

    Fixes: fe3475af3bdf ("net: rds: add per rds connection cache statistics")
    CC: Joe Jin
    CC: JUNXIAO_BI
    Suggested-by: Gerd Rausch
    Signed-off-by: Zhu Yanjun
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Zhu Yanjun
     

24 Aug, 2019

1 commit

  • Add the RDMA cookie and RX timestamp to the usercopy whitelist.

    After the introduction of hardened usercopy whitelisting
    (https://lwn.net/Articles/727322/), a warning is displayed when the
    RDMA cookie or RX timestamp is copied to userspace:

    kernel: WARNING: CPU: 3 PID: 5750 at
    mm/usercopy.c:81 usercopy_warn+0x8e/0xa6
    [...]
    kernel: Call Trace:
    kernel: __check_heap_object+0xb8/0x11b
    kernel: __check_object_size+0xe3/0x1bc
    kernel: put_cmsg+0x95/0x115
    kernel: rds_recvmsg+0x43d/0x620 [rds]
    kernel: sock_recvmsg+0x43/0x4a
    kernel: ___sys_recvmsg+0xda/0x1e6
    kernel: ? __handle_mm_fault+0xcae/0xf79
    kernel: __sys_recvmsg+0x51/0x8a
    kernel: SyS_recvmsg+0x12/0x1c
    kernel: do_syscall_64+0x79/0x1ae

    When the whitelisting feature was introduced, the memory for the RDMA
    cookie and RX timestamp in RDS was not added to the whitelist, causing
    the warning above.

    Signed-off-by: Dag Moxnes
    Tested-by: Jenny
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Dag Moxnes
     

16 Aug, 2019

5 commits

  • Original commit from 2011 updated to include a change by
    Yuval Shaia
    that adds a new statistic counter "send_stuck_rm"
    to capture the messages looping exessively
    in the send path.

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Andy Grover
     
  • In a previous commit, fields were added to "struct rds_statistics"
    but array "rds_stat_names" was not updated accordingly.

    Please note the inconsistent naming of the string representations
    that is done in the name of compatibility
    with the Oracle internal code-base.

    s_recv_bytes_added_to_socket -> "recv_bytes_added_to_sock"
    s_recv_bytes_removed_from_socket -> "recv_bytes_freed_fromsock"

    Fixes: 192a798f5299 ("RDS: add stat for socket recv memory usage")
    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     
  • Signed-off-by: Chris Mason
    Signed-off-by: Bang Nguyen
    Signed-off-by: Gerd Rausch
    Signed-off-by: Somasundaram Krishnasamy
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Chris Mason
     
  • This will kick the RDS worker thread if we have been looping
    too long.

    Original commit from 2012 updated to include a change by
    Venkat Venkatsubra
    that triggers "must_wake" if "rds_ib_recv_refill_one" fails.

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Chris Mason
     
  • Add support of the socket options RDS6_INFO_SOCKETS and
    RDS6_INFO_RECV_MESSAGES which update the RDS_INFO_SOCKETS and
    RDS_INFO_RECV_MESSAGES options respectively. The old options work
    for IPv4 sockets only.

    Signed-off-by: Ka-Cheong Poon
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     

28 Jul, 2019

1 commit

  • In rds_rdma_cm_event_handler_cmn(), there are some if statements to
    check whether conn is NULL, such as on lines 65, 96 and 112.
    But conn is not checked before being used on line 108:
    trans->cm_connect_complete(conn, event);
    and on lines 140-143:
    rdsdebug("DISCONNECT event - dropping connection "
    "%pI6c->%pI6c\n", &conn->c_laddr,
    &conn->c_faddr);
    rds_conn_drop(conn);

    Thus, possible null-pointer dereferences may occur.

    To fix these bugs, conn is checked before being used.

    These bugs are found by a static analysis tool STCheck written by us.

    Signed-off-by: Jia-Ju Bai
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Jia-Ju Bai
     

20 Jul, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Fix AF_XDP cq entry leak, from Ilya Maximets.

    2) Fix handling of PHY power-down on RTL8411B, from Heiner Kallweit.

    3) Add some new PCI IDs to iwlwifi, from Ihab Zhaika.

    4) Fix handling of neigh timers wrt. entries added by userspace, from
    Lorenzo Bianconi.

    5) Various cases of missing of_node_put(), from Nishka Dasgupta.

    6) The new NET_ACT_CT needs to depend upon NF_NAT, from Yue Haibing.

    7) Various RDS layer fixes, from Gerd Rausch.

    8) Fix some more fallout from TCQ_F_CAN_BYPASS generalization, from
    Cong Wang.

    9) Fix FIB source validation checks over loopback, also from Cong Wang.

    10) Use promisc for unsupported number of filters, from Justin Chen.

    11) Missing sibling route unlink on failure in ipv6, from Ido Schimmel.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
    tcp: fix tcp_set_congestion_control() use from bpf hook
    ag71xx: fix return value check in ag71xx_probe()
    ag71xx: fix error return code in ag71xx_probe()
    usb: qmi_wwan: add D-Link DWM-222 A2 device ID
    bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
    net: dsa: sja1105: Fix missing unlock on error in sk_buff()
    gve: replace kfree with kvfree
    selftests/bpf: fix test_xdp_noinline on s390
    selftests/bpf: fix "valid read map access into a read-only array 1" on s390
    net/mlx5: Replace kfree with kvfree
    MAINTAINERS: update netsec driver
    ipv6: Unlink sibling route in case of failure
    liquidio: Replace vmalloc + memset with vzalloc
    udp: Fix typo in net/ipv4/udp.c
    net: bcmgenet: use promisc for unsupported filters
    ipv6: rt6_check should return NULL if 'from' is NULL
    tipc: initialize 'validated' field of received packets
    selftests: add a test case for rp_filter
    fib: relax source validation check for loopback packets
    mlxsw: spectrum: Do not process learned records with a dummy FID
    ...

    Linus Torvalds
     

18 Jul, 2019

7 commits

  • Otherwise, if an IB connection is torn down before "rds_ib_setup_qp"
    is called, the value of "ic->i_fastreg_wrs" is still at zero
    (as it wasn't initialized by "rds_ib_setup_qp").
    Consequently "rds_ib_conn_path_shutdown" will spin forever,
    waiting for it to go back to "RDS_IB_DEFAULT_FR_WR",
    which of course will never happen as there are no
    outstanding work requests.

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     
  • Since "rds_ib_free_frmr" and "rds_ib_free_frmr_list" simply put
    the FRMR memory segments on the "drop_list" or "free_list",
    and it is the job of "rds_ib_flush_mr_pool" to reap those entries
    by ultimately issuing a "IB_WR_LOCAL_INV" work-request,
    we need to trigger and then wait for all those memory segments
    attached to a particular connection to be fully released before
    we can move on to release the QP, CQ, etc.

    So we make "rds_ib_conn_path_shutdown" wait for one more
    atomic_t called "i_fastreg_inuse_count" that keeps track of how
    many FRWR memory segments are out there marked "FRMR_IS_INUSE"
    (and also wake_up rds_ib_ring_empty_wait, as they go away).

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     
  • Fix a bug where fr_state first goes to FRMR_IS_STALE, because of a failure
    of operation IB_WR_LOCAL_INV, but then gets set back to "FRMR_IS_FREE"
    uncoditionally, even though the operation failed.

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     
  • Make function "rds_ib_try_reuse_ibmr" return NULL in case
    memory region could not be allocated, since callers
    simply check if the return value is not NULL.

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     
  • In order to:
    1) avoid a silly bouncing between "clean_list" and "drop_list"
    triggered by function "rds_ib_reg_frmr" as it is releases frmr
    regions whose state is not "FRMR_IS_FREE" right away.

    2) prevent an invalid access error in a race from a pending
    "IB_WR_LOCAL_INV" operation with a teardown ("dma_unmap_sg", "put_page")
    and de-registration ("ib_dereg_mr") of the corresponding
    memory region.

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     
  • Waiting for activity on the "clean_list" to quiesce is no substitute
    for proper locking.

    We can have multiple threads competing for "llist_del_first"
    via "rds_ib_reuse_mr", and a single thread competing
    for "llist_del_all" and "llist_del_first" via "rds_ib_flush_mr_pool".

    Since "llist_del_first" depends on "list->first->next" not to change
    in the midst of the operation, simply waiting for all current calls
    to "rds_ib_reuse_mr" to quiesce across all CPUs is woefully inadequate:

    By the time "wait_clean_list_grace" is done iterating over all CPUs to see
    that there is no concurrent caller to "rds_ib_reuse_mr", a new caller may
    have just shown up on the first CPU.

    Furthermore, explicitly calls out the need for locking:
    * Cases where locking is needed:
    * If we have multiple consumers with llist_del_first used in one consumer,
    * and llist_del_first or llist_del_all used in other consumers,
    * then a lock is needed.

    Also, while at it, drop the unused "pool" parameter
    from "list_to_llist_nodes".

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     
  • In the context of FRMR (ib_frmr.c):

    Memory regions make it onto the "clean_list" via "rds_ib_flush_mr_pool",
    after the memory region has been posted for invalidation via
    "rds_ib_post_inv".

    At that point in time, "fr_state" may still be in state "FRMR_IS_INUSE",
    since the only place where "fr_state" transitions to "FRMR_IS_FREE"
    is in "rds_ib_mr_cqe_handler", which is triggered by a tasklet.

    So in case we notice that "fr_state != FRMR_IS_FREE" (see below),
    we wait for "fr_inv_done" to trigger with a maximum of 10msec.
    Then we check again, and only put the memory region onto the drop_list
    (via "rds_ib_free_frmr") in case the situation remains unchanged.

    This avoids the problem of memory-regions bouncing between "clean_list"
    and "drop_list" before they even have a chance to be properly invalidated.

    Signed-off-by: Gerd Rausch
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Gerd Rausch
     

16 Jul, 2019

1 commit

  • Pull rdma updates from Jason Gunthorpe:
    "A smaller cycle this time. Notably we see another new driver, 'Soft
    iWarp', and the deletion of an ancient unused driver for nes.

    - Revise and simplify the signature offload RDMA MR APIs

    - More progress on hoisting object allocation boiler plate code out
    of the drivers

    - Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib,
    i40iw

    - Tree wide cleanups: struct_size, put_user_page, xarray, rst doc
    conversion

    - Removal of obsolete ib_ucm chardev and nes driver

    - netlink based discovery of chardevs and autoloading of the modules
    providing them

    - Move more of the rdamvt/hfi1 uapi to include/uapi/rdma

    - New driver 'siw' for software based iWarp running on top of netdev,
    much like rxe's software RoCE.

    - mlx5 feature to report events in their raw devx format to userspace

    - Expose per-object counters through rdma tool

    - Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core
    from netdev"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (194 commits)
    RMDA/siw: Require a 64 bit arch
    RDMA/siw: Mark expected switch fall-throughs
    RDMA/core: Fix -Wunused-const-variable warnings
    rdma/siw: Remove set but not used variable 's'
    rdma/siw: Add missing dependencies on LIBCRC32C and DMA_VIRT_OPS
    RDMA/siw: Add missing rtnl_lock around access to ifa
    rdma/siw: Use proper enumerated type in map_cqe_status
    RDMA/siw: Remove unnecessary kthread create/destroy printouts
    IB/rdmavt: Fix variable shadowing issue in rvt_create_cq
    RDMA/core: Fix race when resolving IP address
    RDMA/core: Make rdma_counter.h compile stand alone
    IB/core: Work on the caller socket net namespace in nldev_newlink()
    RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
    RDMA/mlx5: Set RDMA DIM to be enabled by default
    RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink
    RDMA/core: Provide RDMA DIM support for ULPs
    linux/dim: Implement RDMA adaptive moderation (DIM)
    IB/mlx5: Report correctly tag matching rendezvous capability
    docs: infiniband: add it to the driver-api bookset
    IB/mlx5: Implement VHCA tunnel mechanism in DEVX
    ...

    Linus Torvalds
     

10 Jul, 2019

5 commits

  • Connections with legitimate tos values can get into usual connection
    race. It can result in consumer reject. We don't want tos value or
    protocol version to be demoted for such connections otherwise
    piers would end up different tos values which can results in
    no connection. Example a peer initiated connection with say
    tos 8 while usual connection racing can get downgraded to tos 0
    which is not desirable.

    Patch fixes above issue introduced by commit
    commit d021fabf525f ("rds: rdma: add consumer reject")

    Reported-by: Yanjun Zhu
    Tested-by: Yanjun Zhu
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • The proper "tos" value needs to be returned
    to user-space (sockopt RDS_INFO_CONNECTIONS).

    Fixes: 3eb450367d08 ("rds: add type of service(tos) infrastructure")
    Signed-off-by: Gerd Rausch
    Reviewed-by: Zhu Yanjun
    Signed-off-by: Santosh Shilimkar

    Gerd Rausch
     
  • Prior to
    commit d021fabf525ff ("rds: rdma: add consumer reject")

    function "rds_rdma_cm_event_handler_cmn" would always honor a rejected
    connection attempt by issuing a "rds_conn_drop".

    The commit mentioned above added a "break", eliminating
    the "fallthrough" case and made the "rds_conn_drop" rather conditional:

    Now it only happens if a "consumer defined" reject (i.e. "rdma_reject")
    carries an integer-value of "1" inside "private_data":

    if (!conn)
    break;
    err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
    if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
    pr_warn("RDS/RDMA: conn rejected, dropping connection\n",
    &conn->c_laddr, &conn->c_faddr);
    conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
    rds_conn_drop(conn);
    }
    rdsdebug("Connection rejected: %s\n",
    rdma_reject_msg(cm_id, event->status));
    break;
    /* FALLTHROUGH */
    A number of issues are worth mentioning here:
    #1) Previous versions of the RDS code simply rejected a connection
    by calling "rdma_reject(cm_id, NULL, 0);"
    So the value of the payload in "private_data" will not be "1",
    but "0".

    #2) Now the code has become dependent on host byte order and sizing.
    If one peer is big-endian, the other is little-endian,
    or there's a difference in sizeof(int) (e.g. ILP64 vs LP64),
    the *err check does not work as intended.

    #3) There is no check for "len" to see if the data behind *err is even valid.
    Luckily, it appears that the "rdma_reject(cm_id, NULL, 0)" will always
    carry 148 bytes of zeroized payload.
    But that should probably not be relied upon here.

    #4) With the added "break;",
    we might as well drop the misleading "/* FALLTHROUGH */" comment.

    This commit does _not_ address issue #2, as the sender would have to
    agree on a byte order as well.

    Here is the sequence of messages in this observed error-scenario:
    Host-A is pre-QoS changes (excluding the commit mentioned above)
    Host-B is post-QoS changes (including the commit mentioned above)

    #1 Host-B
    issues a connection request via function "rds_conn_path_transition"
    connection state transitions to "RDS_CONN_CONNECTING"

    #2 Host-A
    rejects the incompatible connection request (from #1)
    It does so by calling "rdma_reject(cm_id, NULL, 0);"

    #3 Host-B
    receives an "RDMA_CM_EVENT_REJECTED" event (from #2)
    But since the code is changed in the way described above,
    it won't drop the connection here, simply because "*err == 0".

    #4 Host-A
    issues a connection request

    #5 Host-B
    receives an "RDMA_CM_EVENT_CONNECT_REQUEST" event
    and ends up calling "rds_ib_cm_handle_connect".
    But since the state is already in "RDS_CONN_CONNECTING"
    (as of #1) it will end up issuing a "rdma_reject" without
    dropping the connection:
    if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
    /* Wait and see - our connect may still be succeeding */
    rds_ib_stats_inc(s_ib_connect_raced);
    }
    goto out;

    #6 Host-A
    receives an "RDMA_CM_EVENT_REJECTED" event (from #5),
    drops the connection and tries again (goto #4) until it gives up.

    Tested-by: Zhu Yanjun
    Signed-off-by: Gerd Rausch
    Signed-off-by: Santosh Shilimkar

    Gerd Rausch
     
  • This reverts commit 56012459310a1dbcc55c2dbf5500a9f7571402cb.

    RDS kept spinning inside function "rds_ib_post_reg_frmr", waiting for
    "i_fastreg_wrs" to become incremented:
    while (atomic_dec_return(&ibmr->ic->i_fastreg_wrs) ic->i_fastreg_wrs);
    cpu_relax();
    }

    Looking at the original commit:

    commit 56012459310a ("RDS: IB: split the mr registration and
    invalidation path")

    In there, the "rds_ib_mr_cqe_handler" was changed in the following
    way:

    void rds_ib_mr_cqe_handler(struct
    rds_ib_connection *ic,
    struct ib_wc *wc)
    if (frmr->fr_inv) {
    frmr->fr_state = FRMR_IS_FREE;
    frmr->fr_inv = false;
    atomic_inc(&ic->i_fastreg_wrs);
    } else {
    atomic_inc(&ic->i_fastunreg_wrs);
    }

    It looks like it's got it exactly backwards:

    Function "rds_ib_post_reg_frmr" keeps track of the outstanding
    requests via "i_fastreg_wrs".

    Function "rds_ib_post_inv" keeps track of the outstanding requests
    via "i_fastunreg_wrs" (post original commit). It also sets:
    frmr->fr_inv = true;

    However the completion handler "rds_ib_mr_cqe_handler" adjusts
    "i_fastreg_wrs" when "fr_inv" had been true, and adjusts
    "i_fastunreg_wrs" otherwise.

    The original commit was done in the name of performance:
    to remove the performance bottleneck

    No performance benefit could be observed with a fixed-up version
    of the original commit measured between two Oracle X7 servers,
    both equipped with Mellanox Connect-X5 HCAs.

    The prudent course of action is to revert this commit.

    Signed-off-by: Gerd Rausch
    Signed-off-by: Santosh Shilimkar

    Gerd Rausch
     
  • RDS composite message(rdma + control) user notification needs to be
    triggered once the full message is delivered and such a fix was
    added as part of commit 941f8d55f6d61 ("RDS: RDMA: Fix the composite
    message user notification"). But rds_send_remove_from_sock is missing
    data part notify check and hence at times the user don't get
    notification which isn't desirable.

    One way is to fix the rds_send_remove_from_sock to check of that case
    but considering the ordering complexity with completion handler and
    rdma + control messages are always dispatched back to back in same send
    context, just delaying the signaled completion on rmda work request also
    gets the desired behaviour. i.e Notifying application only after
    RDMA + control message send completes. So patch updates the earlier
    fix with this approach. The delay signaling completions of rdma op
    till the control message send completes fix was done by Venkat
    Venkatsubra in downstream kernel.

    Reviewed-and-tested-by: Zhu Yanjun
    Reviewed-by: Gerd Rausch
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     

29 Jun, 2019

1 commit

  • For dependencies in next patches.

    Resolve conflicts:
    - Use uverbs_get_cleared_udata() with new cq allocation flow
    - Continue to delete nes despite SPDX conflict
    - Resolve list appends in mlx5_command_str()
    - Use u16 for vport_rule stuff
    - Resolve list appends in struct ib_client

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

08 Jun, 2019

1 commit


07 Jun, 2019

1 commit

  • When the following tests last for several hours, the problem will occur.

    Server:
    rds-stress -r 1.1.1.16 -D 1M
    Client:
    rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30

    The following will occur.

    "
    Starting up....
    tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu
    %
    1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
    1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
    1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
    1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
    "
    >From vmcore, we can find that clean_list is NULL.

    >From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
    Then rds_ib_mr_pool_flush_worker calls
    "
    rds_ib_flush_mr_pool(pool, 0, NULL);
    "
    Then in function
    "
    int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
    int free_all, struct rds_ib_mr **ibmr_ret)
    "
    ibmr_ret is NULL.

    In the source code,
    "
    ...
    list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
    if (ibmr_ret)
    *ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);

    /* more than one entry in llist nodes */
    if (clean_nodes->next)
    llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
    ...
    "
    When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
    instead of clean_nodes is added in clean_list.
    So clean_nodes is discarded. It can not be used again.
    The workqueue is executed periodically. So more and more clean_nodes are
    discarded. Finally the clean_list is NULL.
    Then this problem will occur.

    Fixes: 1bc144b62524 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
    Signed-off-by: Zhu Yanjun
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Zhu Yanjun
     

06 Jun, 2019

2 commits

  • When KASAN is enabled, after several rds connections are
    created, then "rmmod rds_rdma" is run. The following will
    appear.

    "
    BUG rds_ib_incoming (Not tainted): Objects remaining
    in rds_ib_incoming on __kmem_cache_shutdown()

    Call Trace:
    dump_stack+0x71/0xab
    slab_err+0xad/0xd0
    __kmem_cache_shutdown+0x17d/0x370
    shutdown_cache+0x17/0x130
    kmem_cache_destroy+0x1df/0x210
    rds_ib_recv_exit+0x11/0x20 [rds_rdma]
    rds_ib_exit+0x7a/0x90 [rds_rdma]
    __x64_sys_delete_module+0x224/0x2c0
    ? __ia32_sys_delete_module+0x2c0/0x2c0
    do_syscall_64+0x73/0x190
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    "
    This is rds connection memory leak. The root cause is:
    When "rmmod rds_rdma" is run, rds_ib_remove_one will call
    rds_ib_dev_shutdown to drop the rds connections.
    rds_ib_dev_shutdown will call rds_conn_drop to drop rds
    connections as below.
    "
    rds_conn_path_drop(&conn->c_path[0], false);
    "
    In the above, destroy is set to false.
    void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
    {
    atomic_set(&cp->cp_state, RDS_CONN_ERROR);

    rcu_read_lock();
    if (!destroy && rds_destroy_pending(cp->cp_conn)) {
    rcu_read_unlock();
    return;
    }
    queue_work(rds_wq, &cp->cp_down_w);
    rcu_read_unlock();
    }
    In the above function, destroy is set to false. rds_destroy_pending
    is called. This does not move rds connections to ib_nodev_conns.
    So destroy is set to true to move rds connections to ib_nodev_conns.
    In rds_ib_unregister_client, flush_workqueue is called to make rds_wq
    finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns
    is called to shutdown rds connections finally.
    Then rds_ib_recv_exit is called to destroy slab.

    void rds_ib_recv_exit(void)
    {
    kmem_cache_destroy(rds_ib_incoming_slab);
    kmem_cache_destroy(rds_ib_frag_slab);
    }
    The above slab memory leak will not occur again.

    >From tests,
    256 rds connections
    [root@ca-dev14 ~]# time rmmod rds_rdma

    real 0m16.522s
    user 0m0.000s
    sys 0m8.152s
    512 rds connections
    [root@ca-dev14 ~]# time rmmod rds_rdma

    real 0m32.054s
    user 0m0.000s
    sys 0m15.568s

    To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed.
    And with 512 rds connections, about 32 seconds are needed.
    >From ftrace, when one rds connection is destroyed,

    "
    19) | rds_conn_destroy [rds]() {
    19) 7.782 us | rds_conn_path_drop [rds]();
    15) | rds_shutdown_worker [rds]() {
    15) | rds_conn_shutdown [rds]() {
    15) 1.651 us | rds_send_path_reset [rds]();
    15) 7.195 us | }
    15) + 11.434 us | }
    19) 2.285 us | rds_cong_remove_conn [rds]();
    19) * 24062.76 us | }
    "
    So if many rds connections will be destroyed, this function
    rds_ib_destroy_nodev_conns uses most of time.

    Suggested-by: Håkon Bugge
    Signed-off-by: Zhu Yanjun
    Signed-off-by: David S. Miller

    Zhu Yanjun
     
  • The variable cache_allocs is to indicate how many frags (KiB) are in one
    rds connection frag cache.
    The command "rds-info -Iv" will output the rds connection cache
    statistics as below:
    "
    RDS IB Connections:
    LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
    1.1.1.14 1.1.1.14 58 255 fe80::2:c903:a:7a31 fe80::2:c903:a:7a31
    send_wr=256, recv_wr=1024, send_sge=8, rdma_mr_max=4096,
    rdma_mr_size=257, cache_allocs=12
    "
    This means that there are about 12KiB frag in this rds connection frag
    cache.
    Since rds.h in rds-tools is not related with the kernel rds.h, the change
    in kernel rds.h does not affect rds-tools.
    rds-info in rds-tools 2.0.5 and 2.0.6 is tested with this commit. It works
    well.

    Signed-off-by: Zhu Yanjun
    Signed-off-by: David S. Miller

    Zhu Yanjun
     

22 May, 2019

1 commit