06 May, 2019

1 commit


29 Mar, 2019

1 commit

  • When it is to cleanup net namespace, rds_tcp_exit_net() will call
    rds_tcp_kill_sock(), if t_sock is NULL, it will not call
    rds_conn_destroy(), rds_conn_path_destroy() and rds_tcp_conn_free() to free
    connection, and the worker cp_conn_w is not stopped, afterwards the net is freed in
    net_drop_ns(); While cp_conn_w rds_connect_worker() will call rds_tcp_conn_path_connect()
    and reference 'net' which has already been freed.

    In rds_tcp_conn_path_connect(), rds_tcp_set_callbacks() will set t_sock = sock before
    sock->ops->connect, but if connect() is failed, it will call
    rds_tcp_restore_callbacks() and set t_sock = NULL, if connect is always
    failed, rds_connect_worker() will try to reconnect all the time, so
    rds_tcp_kill_sock() will never to cancel worker cp_conn_w and free the
    connections.

    Therefore, the condition !tc->t_sock is not needed if it is going to do
    cleanup_net->rds_tcp_exit_net->rds_tcp_kill_sock, because tc->t_sock is always
    NULL, and there is on other path to cancel cp_conn_w and free
    connection. So this patch is to fix this.

    rds_tcp_kill_sock():
    ...
    if (net != c_net || !tc->t_sock)
    ...
    Acked-by: Santosh Shilimkar

    ==================================================================
    BUG: KASAN: use-after-free in inet_create+0xbcc/0xd28
    net/ipv4/af_inet.c:340
    Read of size 4 at addr ffff8003496a4684 by task kworker/u8:4/3721

    CPU: 3 PID: 3721 Comm: kworker/u8:4 Not tainted 5.1.0 #11
    Hardware name: linux,dummy-virt (DT)
    Workqueue: krdsd rds_connect_worker
    Call trace:
    dump_backtrace+0x0/0x3c0 arch/arm64/kernel/time.c:53
    show_stack+0x28/0x38 arch/arm64/kernel/traps.c:152
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x120/0x188 lib/dump_stack.c:113
    print_address_description+0x68/0x278 mm/kasan/report.c:253
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x21c/0x348 mm/kasan/report.c:409
    __asan_report_load4_noabort+0x30/0x40 mm/kasan/report.c:429
    inet_create+0xbcc/0xd28 net/ipv4/af_inet.c:340
    __sock_create+0x4f8/0x770 net/socket.c:1276
    sock_create_kern+0x50/0x68 net/socket.c:1322
    rds_tcp_conn_path_connect+0x2b4/0x690 net/rds/tcp_connect.c:114
    rds_connect_worker+0x108/0x1d0 net/rds/threads.c:175
    process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
    worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
    kthread+0x2f0/0x378 kernel/kthread.c:255
    ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117

    Allocated by task 687:
    save_stack mm/kasan/kasan.c:448 [inline]
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xd4/0x180 mm/kasan/kasan.c:553
    kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:490
    slab_post_alloc_hook mm/slab.h:444 [inline]
    slab_alloc_node mm/slub.c:2705 [inline]
    slab_alloc mm/slub.c:2713 [inline]
    kmem_cache_alloc+0x14c/0x388 mm/slub.c:2718
    kmem_cache_zalloc include/linux/slab.h:697 [inline]
    net_alloc net/core/net_namespace.c:384 [inline]
    copy_net_ns+0xc4/0x2d0 net/core/net_namespace.c:424
    create_new_namespaces+0x300/0x658 kernel/nsproxy.c:107
    unshare_nsproxy_namespaces+0xa0/0x198 kernel/nsproxy.c:206
    ksys_unshare+0x340/0x628 kernel/fork.c:2577
    __do_sys_unshare kernel/fork.c:2645 [inline]
    __se_sys_unshare kernel/fork.c:2643 [inline]
    __arm64_sys_unshare+0x38/0x58 kernel/fork.c:2643
    __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
    invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
    el0_svc_common+0x168/0x390 arch/arm64/kernel/syscall.c:83
    el0_svc_handler+0x60/0xd0 arch/arm64/kernel/syscall.c:129
    el0_svc+0x8/0xc arch/arm64/kernel/entry.S:960

    Freed by task 264:
    save_stack mm/kasan/kasan.c:448 [inline]
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:521
    kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:528
    slab_free_hook mm/slub.c:1370 [inline]
    slab_free_freelist_hook mm/slub.c:1397 [inline]
    slab_free mm/slub.c:2952 [inline]
    kmem_cache_free+0xb8/0x3a8 mm/slub.c:2968
    net_free net/core/net_namespace.c:400 [inline]
    net_drop_ns.part.6+0x78/0x90 net/core/net_namespace.c:407
    net_drop_ns net/core/net_namespace.c:406 [inline]
    cleanup_net+0x53c/0x6d8 net/core/net_namespace.c:569
    process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
    worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
    kthread+0x2f0/0x378 kernel/kthread.c:255
    ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117

    The buggy address belongs to the object at ffff8003496a3f80
    which belongs to the cache net_namespace of size 7872
    The buggy address is located 1796 bytes inside of
    7872-byte region [ffff8003496a3f80, ffff8003496a5e40)
    The buggy address belongs to the page:
    page:ffff7e000d25a800 count:1 mapcount:0 mapping:ffff80036ce4b000
    index:0x0 compound_mapcount: 0
    flags: 0xffffe0000008100(slab|head)
    raw: 0ffffe0000008100 dead000000000100 dead000000000200 ffff80036ce4b000
    raw: 0000000000000000 0000000080040004 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8003496a4580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8003496a4600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff8003496a4680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8003496a4700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8003496a4780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Fixes: 467fa15356ac("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
    Reported-by: Hulk Robot
    Signed-off-by: Mao Wenan
    Signed-off-by: David S. Miller

    Mao Wenan
     

05 Feb, 2019

2 commits

  • RDMA transport maps user tos to underline virtual lanes(VL)
    for IB or DSCP values. RDMA CM transport abstract thats for
    RDS. TCP transport makes use of default priority 0 and maps
    all user tos values to it.

    Reviewed-by: Sowmini Varadhan
    Signed-off-by: Santosh Shilimkar
    [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
    Signed-off-by: Zhu Yanjun

    Santosh Shilimkar
     
  • RDS Service type (TOS) is user-defined and needs to be configured
    via RDS IOCTL interface. It must be set before initiating any
    traffic and once set the TOS can not be changed. All out-going
    traffic from the socket will be associated with its TOS.

    Reviewed-by: Sowmini Varadhan
    Signed-off-by: Santosh Shilimkar
    [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
    Signed-off-by: Zhu Yanjun

    Santosh Shilimkar
     

02 Jan, 2019

1 commit


22 Aug, 2018

1 commit


02 Aug, 2018

1 commit


25 Jul, 2018

1 commit


24 Jul, 2018

3 commits

  • There are many data structures (RDS socket options) used by RDS apps
    which use a 32 bit integer to store IP address. To support IPv6,
    struct in6_addr needs to be used. To ensure backward compatibility, a
    new data structure is introduced for each of those data structures
    which use a 32 bit integer to represent an IP address. And new socket
    options are introduced to use those new structures. This means that
    existing apps should work without a problem with the new RDS module.
    For apps which want to use IPv6, those new data structures and socket
    options can be used. IPv4 mapped address is used to represent IPv4
    address in the new data structures.

    v4: Revert changes to SO_RDS_TRANSPORT

    Signed-off-by: Ka-Cheong Poon
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     
  • This patch enables RDS to use IPv6 addresses. For RDS/TCP, the
    listener is now an IPv6 endpoint which accepts both IPv4 and IPv6
    connection requests. RDS/RDMA/IB uses a private data (struct
    rds_ib_connect_private) exchange between endpoints at RDS connection
    establishment time to support RDMA. This private data exchange uses a
    32 bit integer to represent an IP address. This needs to be changed in
    order to support IPv6. A new private data struct
    rds6_ib_connect_private is introduced to handle this. To ensure
    backward compatibility, an IPv6 capable RDS stack uses another RDMA
    listener port (RDS_CM_PORT) to accept IPv6 connection. And it
    continues to use the original RDS_PORT for IPv4 RDS connections. When
    it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to
    send the connection set up request.

    v5: Fixed syntax problem (David Miller).

    v4: Changed port history comments in rds.h (Sowmini Varadhan).

    v3: Added support to set up IPv4 connection using mapped address
    (David Miller).
    Added support to set up connection between link local and non-link
    addresses.
    Various review comments from Santosh Shilimkar and Sowmini Varadhan.

    v2: Fixed bound and peer address scope mismatched issue.
    Added back rds_connect() IPv6 changes.

    Signed-off-by: Ka-Cheong Poon
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     
  • This patch changes the internal representation of an IP address to use
    struct in6_addr. IPv4 address is stored as an IPv4 mapped address.
    All the functions which take an IP address as argument are also
    changed to use struct in6_addr. But RDS socket layer is not modified
    such that it still does not accept IPv6 address from an application.
    And RDS layer does not accept nor initiate IPv6 connections.

    v2: Fixed sparse warnings.

    Signed-off-by: Ka-Cheong Poon
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     

28 Mar, 2018

1 commit


22 Mar, 2018

1 commit

  • The netns deletion path does not need to wait for all net_devices
    to be unregistered before dismantling rds_tcp state for the netns
    (we are able to dismantle this state on module unload even when
    all net_devices are active so there is no dependency here).

    This patch removes code related to netdevice notifiers and
    refactors all the code needed to dismantle rds_tcp state
    into a ->exit callback for the pernet_operations used with
    register_pernet_device().

    Signed-off-by: Sowmini Varadhan
    Reviewed-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

18 Mar, 2018

1 commit

  • rds_tcp_connection allocation/free management has the potential to be
    called from __rds_conn_create after IRQs have been disabled, so
    spin_[un]lock_bh cannot be used with rds_tcp_conn_lock.

    Bottom-halves that need to synchronize for critical sections protected
    by rds_tcp_conn_lock should instead use rds_destroy_pending() correctly.

    Reported-by: syzbot+c68e51bb5e699d3f8d91@syzkaller.appspotmail.com
    Fixes: ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize
    netns/module teardown and rds connection/workq management")
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

13 Mar, 2018

1 commit

  • These pernet_operations create and destroy sysctl table
    and listen socket. Also, exit method flushes global
    workqueue and work. Everything looks per-net safe,
    so we can mark them async.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

13 Feb, 2018

1 commit

  • Changes since v1:
    Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

    Before:
    All these functions either return a negative error indicator,
    or store length of sockaddr into "int *socklen" parameter
    and return zero on success.

    "int *socklen" parameter is awkward. For example, if caller does not
    care, it still needs to provide on-stack storage for the value
    it does not need.

    None of the many FOO_getname() functions of various protocols
    ever used old value of *socklen. They always just overwrite it.

    This change drops this parameter, and makes all these functions, on success,
    return length of sockaddr. It's always >= 0 and can be differentiated
    from an error.

    Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

    rpc_sockname() lost "int buflen" parameter, since its only use was
    to be passed to kernel_getsockname() as &buflen and subsequently
    not used in any way.

    Userspace API is not changed.

    text data bss dec hex filename
    30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
    30108109 2633612 873672 33615393 200ee21 vmlinux.o

    Signed-off-by: Denys Vlasenko
    CC: David S. Miller
    CC: linux-kernel@vger.kernel.org
    CC: netdev@vger.kernel.org
    CC: linux-bluetooth@vger.kernel.org
    CC: linux-decnet-user@lists.sourceforge.net
    CC: linux-wireless@vger.kernel.org
    CC: linux-rdma@vger.kernel.org
    CC: linux-sctp@vger.kernel.org
    CC: linux-nfs@vger.kernel.org
    CC: linux-x25@vger.kernel.org
    Signed-off-by: David S. Miller

    Denys Vlasenko
     

09 Feb, 2018

1 commit

  • … connection/workq management

    An rds_connection can get added during netns deletion between lines 528
    and 529 of

    506 static void rds_tcp_kill_sock(struct net *net)
    :
    /* code to pull out all the rds_connections that should be destroyed */
    :
    528 spin_unlock_irq(&rds_tcp_conn_lock);
    529 list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node)
    530 rds_conn_destroy(tc->t_cpath->cp_conn);

    Such an rds_connection would miss out the rds_conn_destroy()
    loop (that cancels all pending work) and (if it was scheduled
    after netns deletion) could trigger the use-after-free.

    A similar race-window exists for the module unload path
    in rds_tcp_exit -> rds_tcp_destroy_conns

    Concurrency with netns deletion (rds_tcp_kill_sock()) must be handled
    by checking check_net() before enqueuing new work or adding new
    connections.

    Concurrency with module-unload is handled by maintaining a module
    specific flag that is set at the start of the module exit function,
    and must be checked before enqueuing new work or adding new connections.

    This commit refactors existing RDS_DESTROY_PENDING checks added by
    commit 3db6e0d172c9 ("rds: use RCU to synchronize work-enqueue with
    connection teardown") and consolidates all the concurrency checks
    listed above into the function rds_destroy_pending().

    Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
    Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

    Sowmini Varadhan
     

24 Jan, 2018

1 commit


23 Jan, 2018

1 commit

  • rds-tcp uses m_ack_seq to track the tcp ack# that indicates
    that the peer has received a rds_message. The m_ack_seq is
    used in rds_tcp_is_acked() to figure out when it is safe to
    drop the rds_message from the RDS retransmit queue.

    The m_ack_seq must be calculated as an offset from the right
    edge of the in-flight tcp buffer, i.e., it should be based on
    the ->write_seq, not the ->snd_nxt.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

28 Dec, 2017

2 commits


02 Dec, 2017

3 commits

  • The rds_tcp_kill_sock() function parses the rds_tcp_conn_list
    to find the rds_connection entries marked for deletion as part
    of the netns deletion under the protection of the rds_tcp_conn_lock.
    Since the rds_tcp_conn_list tracks rds_tcp_connections (which
    have a 1:1 mapping with rds_conn_path), multiple tc entries in
    the rds_tcp_conn_list will map to a single rds_connection, and will
    be deleted as part of the rds_conn_destroy() operation that is
    done outside the rds_tcp_conn_lock.

    The rds_tcp_conn_list traversal done under the protection of
    rds_tcp_conn_lock should not leave any doomed tc entries in
    the list after the rds_tcp_conn_lock is released, else another
    concurrently executiong netns delete (for a differnt netns) thread
    may trip on these entries.

    Reported-by: syzbot
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Commit 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
    introduces a regression in rds-tcp netns cleanup. The cleanup_net(),
    (and thus rds_tcp_dev_event notification) is only called from put_net()
    when all netns refcounts go to 0, but this cannot happen if the
    rds_connection itself is holding a c_net ref that it expects to
    release in rds_tcp_kill_sock.

    Instead, the rds_tcp_kill_sock callback should make sure to
    tear down state carefully, ensuring that the socket teardown
    is only done after all data-structures and workqs that depend
    on it are quiesced.

    The original motivation for commit 8edc3affc077 ("rds: tcp: Take explicit
    refcounts on struct net") was to resolve a race condition reported by
    syzkaller where workqs for tx/rx/connect were triggered after the
    namespace was deleted. Those worker threads should have been
    cancelled/flushed before socket tear-down and indeed,
    rds_conn_path_destroy() does try to sequence this by doing
    /* cancel cp_send_w */
    /* cancel cp_recv_w */
    /* flush cp_down_w */
    /* free data structures */
    Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus
    invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that
    we ought to have satisfied the requirement that "socket-close is
    done after all other dependent state is quiesced". However,
    rds_conn_shutdown has a bug in that it *always* triggers the reconnect
    workq (and if connection is successful, we always restart tx/rx
    workqs so with the right timing, we risk the race conditions reported
    by syzkaller).

    Netns deletion is like module teardown- no need to restart a
    reconnect in this case. We can use the c_destroy_in_prog bit
    to avoid restarting the reconnect.

    Fixes: 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • A side-effect of Commit c14b0366813a ("rds: tcp: set linger to 1
    when unloading a rds-tcp") is that we always send a RST on the tcp
    connection for rds_conn_destroy(), so rds_tcp_conn_paths_destroy()
    is not needed any more and is removed in this patch.

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

17 Jul, 2017

1 commit

  • We could end up executing rds_conn_shutdown before the rds_recv_worker
    thread, then rds_conn_shutdown -> rds_tcp_conn_shutdown can do a
    sock_release and set sock->sk to null, which may interleave in bad
    ways with rds_recv_worker, e.g., it could result in:

    "BUG: unable to handle kernel NULL pointer dereference at 0000000000000078"
    [ffff881769f6fd70] release_sock at ffffffff815f337b
    [ffff881769f6fd90] rds_tcp_recv at ffffffffa043c888 [rds_tcp]
    [ffff881769f6fdb0] rds_recv_worker at ffffffffa04a4810 [rds]
    [ffff881769f6fde0] process_one_work at ffffffff810a14c1
    [ffff881769f6fe40] worker_thread at ffffffff810a1940
    [ffff881769f6fec0] kthread at ffffffff810a6b1e

    Also, do not enqueue any new shutdown workq items when the connection is
    shutting down (this may happen for rds-tcp in softirq mode, if a FIN
    or CLOSE is received while the modules is in the middle of an unload)

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

27 Apr, 2017

1 commit

  • …uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess

    Al Viro
     

06 Apr, 2017

1 commit


08 Mar, 2017

3 commits

  • Commit a93d01f5777e ("RDS: TCP: avoid bad page reference in
    rds_tcp_listen_data_ready") added the function
    rds_tcp_listen_sock_def_readable() to handle the case when a
    partially set-up acceptor socket drops into rds_tcp_listen_data_ready().
    However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going
    through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be
    null and we would hit a panic of the form
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: (null)
    :
    ? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp]
    tcp_data_queue+0x39d/0x5b0
    tcp_rcv_established+0x2e5/0x660
    tcp_v4_do_rcv+0x122/0x220
    tcp_v4_rcv+0x8b7/0x980
    :
    In the above case, it is not fatal to encounter a NULL value for
    ready- we should just drop the packet and let the flush of the
    acceptor thread finish gracefully.

    In general, the tear-down sequence for listen() and accept() socket
    that is ensured by this commit is:
    rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */
    In rds_tcp_listen_stop():
    serialize with, and prevent, further callbacks using lock_sock()
    flush rds_wq
    flush acceptor workq
    sock_release(listen socket)

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Order of initialization in rds_tcp_init needs to be done so
    that resources are set up and destroyed in the correct synchronization
    sequence with both the data path, as well as netns create/destroy
    path. Specifically,

    - we must call register_pernet_subsys and get the rds_tcp_netid
    before calling register_netdevice_notifier, otherwise we risk
    the sequence
    1. register_netdevice_notifier sets up netdev notifier callback
    2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds
    the wrong rtn, resulting in a panic with string that is of the form:

    BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
    IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp]
    :

    - the rds_tcp_incoming_slab kmem_cache must be initialized before the
    datapath starts up. The latter can happen any time after the
    pernet_subsys registration of rds_tcp_net_ops, whose -> init
    function sets up the listen socket. If the rds_tcp_incoming_slab has
    not been set up at that time, a panic of the form below may be
    encountered

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
    IP: kmem_cache_alloc+0x90/0x1c0
    :
    rds_tcp_data_recv+0x1e7/0x370 [rds_tcp]
    tcp_read_sock+0x96/0x1c0
    rds_tcp_recv_path+0x65/0x80 [rds_tcp]
    :

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • It is incorrect for the rds_connection to piggyback on the
    sock_net() refcount for the netns because this gives rise to
    a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
    take a ref on the net, and hold the netns down till the connection
    tear-down is complete.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

04 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • When the function register_netdevice_notifier fails, the memory
    allocated by kmem_cache_create should be freed by the function
    kmem_cache_destroy.

    Cc: Joe Jin
    Cc: Junxiao Bi
    Signed-off-by: Zhu Yanjun
    Acked-by: Santosh Shilimkar
    Acked-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Zhu Yanjun
     

04 Dec, 2016

1 commit

  • Couple conflicts resolved here:

    1) In the MACB driver, a bug fix to properly initialize the
    RX tail pointer properly overlapped with some changes
    to support variable sized rings.

    2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
    overlapping with a reorganization of the driver to support
    ACPI, OF, as well as PCI variants of the chip.

    3) In 'net' we had several probe error path bug fixes to the
    stmmac driver, meanwhile a lot of this code was cleaned up
    and reorganized in 'net-next'.

    4) The cls_flower classifier obtained a helper function in
    'net-next' called __fl_delete() and this overlapped with
    Daniel Borkamann's bug fix to use RCU for object destruction
    in 'net'. It also overlapped with Jiri's change to guard
    the rhashtable_remove_fast() call with a check against
    tc_skip_sw().

    5) In mlx4, a revert bug fix in 'net' overlapped with some
    unrelated changes in 'net-next'.

    6) In geneve, a stale header pointer after pskb_expand_head()
    bug fix in 'net' overlapped with a large reorganization of
    the same code in 'net-next'. Since the 'net-next' code no
    longer had the bug in question, there was nothing to do
    other than to simply take the 'net-next' hunks.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Dec, 2016

1 commit


18 Nov, 2016

1 commit

  • Make struct pernet_operations::id unsigned.

    There are 2 reasons to do so:

    1)
    This field is really an index into an zero based array and
    thus is unsigned entity. Using negative value is out-of-bound
    access by definition.

    2)
    On x86_64 unsigned 32-bit data which are mixed with pointers
    via array indexing or offsets added or subtracted to pointers
    are preffered to signed 32-bit data.

    "int" being used as an array index needs to be sign-extended
    to 64-bit before being used.

    void f(long *p, int i)
    {
    g(p[i]);
    }

    roughly translates to

    movsx rsi, esi
    mov rdi, [rsi+...]
    call g

    MOVSX is 3 byte instruction which isn't necessary if the variable is
    unsigned because x86_64 is zero extending by default.

    Now, there is net_generic() function which, you guessed it right, uses
    "int" as an array index:

    static inline void *net_generic(const struct net *net, int id)
    {
    ...
    ptr = ng->ptr[id - 1];
    ...
    }

    And this function is used a lot, so those sign extensions add up.

    Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
    messing with code generation):

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

    Unfortunately some functions actually grow bigger.
    This is a semmingly random artefact of code generation with register
    allocator being used differently. gcc decides that some variable
    needs to live in new r8+ registers and every access now requires REX
    prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
    used which is longer than [r8]

    However, overall balance is in negative direction:

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
    function old new delta
    nfsd4_lock 3886 3959 +73
    tipc_link_build_proto_msg 1096 1140 +44
    mac80211_hwsim_new_radio 2776 2808 +32
    tipc_mon_rcv 1032 1058 +26
    svcauth_gss_legacy_init 1413 1429 +16
    tipc_bcbase_select_primary 379 392 +13
    nfsd4_exchange_id 1247 1260 +13
    nfsd4_setclientid_confirm 782 793 +11
    ...
    put_client_renew_locked 494 480 -14
    ip_set_sockfn_get 730 716 -14
    geneve_sock_add 829 813 -16
    nfsd4_sequence_done 721 703 -18
    nlmclnt_lookup_host 708 686 -22
    nfsd4_lockt 1085 1063 -22
    nfs_get_client 1077 1050 -27
    tcf_bpf_init 1106 1076 -30
    nfsd4_encode_fattr 5997 5930 -67
    Total: Before=154856051, After=154854321, chg -0.00%

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

10 Nov, 2016

1 commit


16 Jul, 2016

3 commits

  • Use RDS probe-ping to compute how many paths may be used with
    the peer, and to synchronously start the multiple paths. If mprds is
    supported, hash outgoing traffic to one of multiple paths in rds_sendmsg()
    when multipath RDS is supported by the transport.

    CC: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Some code duplication in rds_tcp_reset_callbacks() can be avoided
    by having the function call rds_tcp_restore_callbacks() and
    rds_tcp_set_callbacks().

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • As the existing comments in rds_tcp_listen_data_ready() indicate,
    it is possible under some race-windows to get to this function with the
    accept() socket. If that happens, we could run into a sequence whereby

    thread 1 thread 2

    rds_tcp_accept_one() thread
    sets up new_sock via ->accept().
    The sk_user_data is now
    sock_def_readable
    data comes in for new_sock,
    ->sk_data_ready is called, and
    we land in rds_tcp_listen_data_ready
    rds_tcp_set_callbacks()
    takes the sk_callback_lock and
    sets up sk_user_data to be the cp
    read_lock sk_callback_lock
    ready = cp
    unlock sk_callback_lock
    page fault on ready

    In the above sequence, we end up with a panic on a bad page reference
    when trying to execute (*ready)(). Instead we need to call
    sock_def_readable() safely, which is what this patch achieves.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

07 Jul, 2016

1 commit