29 May, 2020

1 commit


20 Dec, 2018

2 commits

  • The commit adds the new trace_events for TIPC socket object:

    trace_tipc_sk_create()
    trace_tipc_sk_poll()
    trace_tipc_sk_sendmsg()
    trace_tipc_sk_sendmcast()
    trace_tipc_sk_sendstream()
    trace_tipc_sk_filter_rcv()
    trace_tipc_sk_advance_rx()
    trace_tipc_sk_rej_msg()
    trace_tipc_sk_drop_msg()
    trace_tipc_sk_release()
    trace_tipc_sk_shutdown()
    trace_tipc_sk_overlimit1()
    trace_tipc_sk_overlimit2()

    Also, enables the traces for the following cases:
    - When user creates a TIPC socket;
    - When user calls poll() on TIPC socket;
    - When user sends a dgram/mcast/stream message.
    - When a message is put into the socket 'sk_receive_queue';
    - When a message is released from the socket 'sk_receive_queue';
    - When a message is rejected (e.g. due to no port, invalid, etc.);
    - When a message is dropped (e.g. due to wrong message type);
    - When socket is released;
    - When socket is shutdown;
    - When socket rcvq's allocation is overlimit (> 90%);
    - When socket rcvq + bklq's allocation is overlimit (> 90%);
    - When the 'TIPC_ERR_OVERLOAD/2' issue happens;

    Note:
    a) All the socket traces are designed to be able to trace on a specific
    socket by either using the 'event filtering' feature on a known socket
    'portid' value or the sysctl file:

    /proc/sys/net/tipc/sk_filter

    The file determines a 'tuple' for what socket should be traced:

    (portid, sock type, name type, name lower, name upper)

    where:
    + 'portid' is the socket portid generated at socket creating, can be
    found in the trace outputs or the 'tipc socket list' command printouts;
    + 'sock type' is the socket type (1 = SOCK_TREAM, ...);
    + 'name type', 'name lower' and 'name upper' are the service name being
    connected to or published by the socket.

    Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
    the traces happen for every sockets with no filter.

    b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
    which happens when the socket receive queue (and backlog queue) is
    about to be overloaded, when the queue allocation is > 90%. Then, when
    the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
    issue can be traced.

    The trace event is designed as an 'upper watermark' notification that
    the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
    actions can be triggerred in the meanwhile to see what is going on with
    the socket queue.

    In addition, the 'trace_tipc_sk_dump()' is also placed at the
    'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
    for post-analysis.

    Acked-by: Ying Xue
    Tested-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • As for the sake of debugging/tracing, the commit enables tracepoints in
    TIPC along with some general trace_events as shown below. It also
    defines some 'tipc_*_dump()' functions that allow to dump TIPC object
    data whenever needed, that is, for general debug purposes, ie. not just
    for the trace_events.

    The following trace_events are now available:

    - trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
    e.g. message type, user, droppable, skb truesize, cloned skb, etc.

    - trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
    queues, e.g. TIPC link transmq, socket receive queue, etc.

    - trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
    sk state, sk type, connection type, rmem_alloc, socket queues, etc.

    - trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
    link state, silent_intv_cnt, gap, bc_gap, link queues, etc.

    - trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
    node state, active links, capabilities, link entries, etc.

    How to use:
    Put the trace functions at any places where we want to dump TIPC data
    or events.

    Note:
    a) The dump functions will generate raw data only, that is, to offload
    the trace event's processing, it can require a tool or script to parse
    the data but this should be simple.

    b) The trace_tipc_*_dump() should be reserved for a failure cases only
    (e.g. the retransmission failure case) or where we do not expect to
    happen too often, then we can consider enabling these events by default
    since they will almost not take any effects under normal conditions,
    but once the rare condition or failure occurs, we get the dumped data
    fully for post-analysis.

    For other trace purposes, we can reuse these trace classes as template
    but different events.

    c) A trace_event is only effective when we enable it. To enable the
    TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
    directory in the 'debugfs' file system. Normally, they are located at:

    /sys/kernel/debug/tracing/events/tipc/

    For example:

    To enable the tipc_link_dump event:

    echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable

    To enable all the TIPC trace_events:

    echo 1 > /sys/kernel/debug/tracing/events/tipc/enable

    To collect the trace data:

    cat trace

    or

    cat trace_pipe > /trace.out &

    To disable all the TIPC trace_events:

    echo 0 > /sys/kernel/debug/tracing/events/tipc/enable

    To clear the trace buffer:

    echo > trace

    d) Like the other trace_events, the feature like 'filter' or 'trigger'
    is also usable for the tipc trace_events.
    For more details, have a look at:

    Documentation/trace/ftrace.txt

    MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc

    Acked-by: Ying Xue
    Tested-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

07 Sep, 2018

1 commit

  • __tipc_nl_compat_dumpit() uses a netlink_callback on stack,
    so the only way to align it with other ->dumpit() call path
    is calling tipc_dump_start() and tipc_dump_done() directly
    inside it. Otherwise ->dumpit() would always get NULL from
    cb->args[].

    But tipc_dump_start() uses sock_net(cb->skb->sk) to retrieve
    net pointer, the cb->skb here doesn't set skb->sk, the net pointer
    is saved in msg->net instead, so introduce a helper function
    __tipc_dump_start() to pass in msg->net.

    Ying pointed out cb->args[0...3] are already used by other
    callbacks on this call path, so we can't use cb->args[0] any
    more, use cb->args[4] instead.

    Fixes: 9a07efa9aea2 ("tipc: switch to rhashtable iterator")
    Reported-and-tested-by: syzbot+e93a2c41f91b8e2c7d9b@syzkaller.appspotmail.com
    Cc: Jon Maloy
    Cc: Ying Xue
    Signed-off-by: Cong Wang
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Cong Wang
     

30 Aug, 2018

1 commit

  • syzbot reported a use-after-free in tipc_group_fill_sock_diag(),
    where tipc_group_fill_sock_diag() still reads tsk->group meanwhile
    tipc_group_delete() just deletes it in tipc_release().

    tipc_nl_sk_walk() aims to lock this sock when walking each sock
    in the hash table to close race conditions with sock changes like
    this one, by acquiring tsk->sk.sk_lock.slock spinlock, unfortunately
    this doesn't work at all. All non-BH call path should take
    lock_sock() instead to make it work.

    tipc_nl_sk_walk() brutally iterates with raw rht_for_each_entry_rcu()
    where RCU read lock is required, this is the reason why lock_sock()
    can't be taken on this path. This could be resolved by switching to
    rhashtable iterator API's, where taking a sleepable lock is possible.
    Also, the iterator API's are friendly for restartable calls like
    diag dump, the last position is remembered behind the scence,
    all we need to do here is saving the iterator into cb->args[].

    I tested this with parallel tipc diag dump and thousands of tipc
    socket creation and release, no crash or memory leak.

    Reported-by: syzbot+b9c8f3ab2994b7cd1625@syzkaller.appspotmail.com
    Cc: Jon Maloy
    Cc: Ying Xue
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

09 Apr, 2018

1 commit

  • Commit 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
    tried to fix the crash but failed, the crash is still 100% reproducible
    with it.

    In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not
    correct to retrieve its NETLINK_CB(), instead, like other protocol diag,
    we should use NETLINK_CB(cb->skb).sk here.

    Reported-by:
    Fixes: 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
    Fixes: c30b70deb5f4 (tipc: implement socket diagnostics for AF_TIPC)
    Cc: GhantaKrishnamurthy MohanKrishna
    Cc: Jon Maloy
    Cc: Ying Xue
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

23 Mar, 2018

1 commit

  • This commit adds socket diagnostics capability for AF_TIPC in netlink
    family NETLINK_SOCK_DIAG in a new kernel module (diag.ko).

    The following are key design considerations:
    - config TIPC_DIAG has default y, like INET_DIAG.
    - only requests with flag NLM_F_DUMP is supported (dump all).
    - tipc_sock_diag_req message is introduced to send filter parameters.
    - the response attributes are of TLV, some nested.

    To avoid exposing data structures between diag and tipc modules and
    avoid code duplication, the following additions are required:
    - export tipc_nl_sk_walk function to reuse socket iterator.
    - export tipc_sk_fill_sock_diag to fill the tipc diag attributes.
    - create a sock_diag response message in __tipc_add_sock_diag defined
    in diag.c and use the above exported tipc_sk_fill_sock_diag
    to fill response.

    Acked-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: GhantaKrishnamurthy MohanKrishna
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    GhantaKrishnamurthy MohanKrishna
     

04 May, 2016

1 commit

  • There are two flow control mechanisms in TIPC; one at link level that
    handles network congestion, burst control, and retransmission, and one
    at connection level which' only remaining task is to prevent overflow
    in the receiving socket buffer. In TIPC, the latter task has to be
    solved end-to-end because messages can not be thrown away once they
    have been accepted and delivered upwards from the link layer, i.e, we
    can never permit the receive buffer to overflow.

    Currently, this algorithm is message based. A counter in the receiving
    socket keeps track of number of consumed messages, and sends a dedicated
    acknowledge message back to the sender for each 256 consumed message.
    A counter at the sending end keeps track of the sent, not yet
    acknowledged messages, and blocks the sender if this number ever reaches
    512 unacknowledged messages. When the missing acknowledge arrives, the
    socket is then woken up for renewed transmission. This works well for
    keeping the message flow running, as it almost never happens that a
    sender socket is blocked this way.

    A problem with the current mechanism is that it potentially is very
    memory consuming. Since we don't distinguish between small and large
    messages, we have to dimension the socket receive buffer according
    to a worst-case of both. I.e., the window size must be chosen large
    enough to sustain a reasonable throughput even for the smallest
    messages, while we must still consider a scenario where all messages
    are of maximum size. Hence, the current fix window size of 512 messages
    and a maximum message size of 66k results in a receive buffer of 66 MB
    when truesize(66k) = 131k is taken into account. It is possible to do
    much better.

    This commit introduces an algorithm where we instead use 1024-byte
    blocks as base unit. This unit, always rounded upwards from the
    actual message size, is used when we advertise windows as well as when
    we count and acknowledge transmitted data. The advertised window is
    based on the configured receive buffer size in such a way that even
    the worst-case truesize/msgsize ratio always is covered. Since the
    smallest possible message size (from a flow control viewpoint) now is
    1024 bytes, we can safely assume this ratio to be less than four, which
    is the value we are now using.

    This way, we have been able to reduce the default receive buffer size
    from 66 MB to 2 MB with maintained performance.

    In order to keep this solution backwards compatible, we introduce a
    new capability bit in the discovery protocol, and use this throughout
    the message sending/reception path to always select the right unit.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

27 Jul, 2015

1 commit

  • When a message is received in a socket, one of the call chains
    tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
    or
    tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
    are followed. At each of these levels we may encounter situations
    where the message may need to be rejected, or a new message
    produced for transfer back to the sender. Despite recent
    improvements, the current code for doing this is perceived
    as awkward and hard to follow.

    Leveraging the two previous commits in this series, we now
    introduce a more uniform handling of such situations. We
    let each of the functions in the chain itself produce/reverse
    the message to be returned to the sender, but also perform the
    actual forwarding. This simplifies the necessary logics within
    each function.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

18 Mar, 2015

1 commit

  • When the TIPC module is loaded, we launch a topology server in kernel
    space, which in its turn is creating TIPC sockets for communication
    with topology server users. Because both the socket's creator and
    provider reside in the same module, it is necessary that the TIPC
    module's reference count remains zero after the server is started and
    the socket created; otherwise it becomes impossible to perform "rmmod"
    even on an idle module.

    Currently, we achieve this by defining a separate "tipc_proto_kern"
    protocol struct, that is used only for kernel space socket allocations.
    This structure has the "owner" field set to NULL, which restricts the
    module reference count from being be bumped when sk_alloc() for local
    sockets is called. Furthermore, we have defined three kernel-specific
    functions, tipc_sock_create_local(), tipc_sock_release_local() and
    tipc_sock_accept_local(), to avoid the module counter being modified
    when module local sockets are created or deleted. This has worked well
    until we introduced name space support.

    However, after name space support was introduced, we have observed that
    a reference count leak occurs, because the netns counter is not
    decremented in tipc_sock_delete_local().

    This commit remedies this problem. But instead of just modifying
    tipc_sock_delete_local(), we eliminate the whole parallel socket
    handling infrastructure, and start using the regular sk_create_kern(),
    kernel_accept() and sk_release_kernel() calls. Since those functions
    manipulate the module counter, we must now compensate for that by
    explicitly decrementing the counter after module local sockets are
    created, and increment it just before calling sk_release_kernel().

    Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace")
    Signed-off-by: Ying Xue
    Reviewed-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reported-by: Cong Wang
    Tested-by: Erik Hugne
    Signed-off-by: David S. Miller

    Ying Xue
     

10 Feb, 2015

1 commit

  • Convert socket (port) listing to compat dumpit call. If a socket
    (port) has publications a second dumpit call is issued to collect them
    and format then into the legacy buffer before continuing to process
    the sockets (ports).

    Command converted in this patch:
    TIPC_CMD_SHOW_PORTS

    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Richard Alpe
     

06 Feb, 2015

3 commits

  • In a previous commit in this series we resolved a race problem during
    unicast message reception.

    Here, we resolve the same problem at multicast reception. We apply the
    same technique: an input queue serializing the delivery of arriving
    buffers. The main difference is that here we do it in two steps.
    First, the broadcast link feeds arriving buffers into the tail of an
    arrival queue, which head is consumed at the socket level, and where
    destination lookup is performed. Second, if the lookup is successful,
    the resulting buffer clones are fed into a second queue, the input
    queue. This queue is consumed at reception in the socket just like
    in the unicast case. Both queues are protected by the same lock, -the
    one of the input queue.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The structure 'tipc_port_list' is used to collect port numbers
    representing multicast destination socket on a receiving node.
    The list is not based on a standard linked list, and is in reality
    optimized for the uncommon case that there are more than one
    multicast destinations per node. This makes the list handling
    unecessarily complex, and as a consequence, even the socket
    multicast reception becomes more complex.

    In this commit, we replace 'tipc_port_list' with a new 'struct
    tipc_plist', which is based on a standard list. We give the new
    list stack (push/pop) semantics, someting that simplifies
    the implementation of the function tipc_sk_mcast_rcv().

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • TIPC handles message cardinality and sequencing at the link layer,
    before passing messages upwards to the destination sockets. During the
    upcall from link to socket no locks are held. It is therefore possible,
    and we see it happen occasionally, that messages arriving in different
    threads and delivered in sequence still bypass each other before they
    reach the destination socket. This must not happen, since it violates
    the sequentiality guarantee.

    We solve this by adding a new input buffer queue to the link structure.
    Arriving messages are added safely to the tail of that queue by the
    link, while the head of the queue is consumed, also safely, by the
    receiving socket. Sequentiality is secured per socket by only allowing
    buffers to be dequeued inside the socket lock. Since there may be multiple
    simultaneous readers of the queue, we use a 'filter' parameter to reduce
    the risk that they peek the same buffer from the queue, hence also
    reducing the risk of contention on the receiving socket locks.

    This solves the sequentiality problem, and seems to cause no measurable
    performance degradation.

    A nice side effect of this change is that lock handling in the functions
    tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
    will enable future simplifications of those functions.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

13 Jan, 2015

4 commits

  • TIPC establishes one subscriber server which allows users to subscribe
    their interesting name service status. After tipc supports namespace,
    one dedicated tipc stack instance is created for each namespace, and
    each instance can be deemed as one independent TIPC node. As a result,
    subscriber server must be built for each namespace.

    Signed-off-by: Ying Xue
    Tested-by: Tero Aho
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Now tipc socket table is statically allocated as a global variable.
    Through it, we can look up one socket instance with port ID, insert
    a new socket instance to the table, and delete a socket from the
    table. But when tipc supports net namespace, each namespace must own
    its specific socket table. So the global variable of socket table
    must be redefined in tipc_net structure. As a concequence, a new
    socket table will be allocated when a new namespace is created, and
    a socket table will be deallocated when namespace is destroyed.

    Signed-off-by: Ying Xue
    Tested-by: Tero Aho
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Global variables associated with node table are below:
    - node table list (node_htable)
    - node hash table list (tipc_node_list)
    - node table lock (node_list_lock)
    - node number counter (tipc_num_nodes)
    - node link number counter (tipc_num_links)

    To make node table support namespace, above global variables must be
    moved to tipc_net structure in order to keep secret for different
    namespaces. As a consequence, these variables are allocated and
    initialized when namespace is created, and deallocated when namespace
    is destroyed. After the change, functions associated with these
    variables have to utilize a namespace pointer to access them. So
    adding namespace pointer as a parameter of these functions is the
    major change made in the commit.

    Signed-off-by: Ying Xue
    Tested-by: Tero Aho
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Only the works of initializing and shutting down tipc module are done
    in core.h and core.c files, so all stuffs which are not closely
    associated with the two tasks should be moved to appropriate places.

    Signed-off-by: Ying Xue
    Tested-by: Tero Aho
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     

09 Jan, 2015

1 commit

  • As tipc reference table is statically allocated, its memory size
    requested on stack initialization stage is quite big even if the
    maximum port number is just restricted to 8191 currently, however,
    the number already becomes insufficient in practice. But if the
    maximum ports is allowed to its theory value - 2^32, its consumed
    memory size will reach a ridiculously unacceptable value. Apart from
    this, heavy tipc users spend a considerable amount of time in
    tipc_sk_get() due to the read-lock on ref_table_lock.

    If tipc reference table is converted with generic rhashtable, above
    mentioned both disadvantages would be resolved respectively: making
    use of the new resizable hash table can avoid locking on the lookup;
    smaller memory size is required at initial stage, for example, 256
    hash bucket slots are requested at the beginning phase instead of
    allocating the entire 8191 slots in old mode. The hash table will
    grow if entries exceeds 75% of table size up to a total table size
    of 1M, and it will automatically shrink if usage falls below 30%,
    but the minimum table size is allowed down to 256.

    Also converts ref_table_lock to a separate mutex to protect hash table
    mutations on write side. Lastly defers the release of the socket
    reference using call_rcu() to allow using an RCU read-side protected
    call to rhashtable_lookup().

    Signed-off-by: Ying Xue
    Acked-by: Jon Maloy
    Acked-by: Erik Hugne
    Cc: Thomas Graf
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Ying Xue
     

22 Nov, 2014

2 commits

  • Add TIPC_NL_PUBL_GET command to the new tipc netlink API.

    This command supports dumping of all publications for a specific
    socket.

    Netlink logical layout of request message:
    -> socket
    -> reference

    Netlink logical layout of response message:
    -> publication
    -> type
    -> lower
    -> upper

    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Add TIPC_NL_SOCK_GET command to the new tipc netlink API.

    This command supports dumping of all available sockets with their
    associated connection or publication(s). It could be extended to reply
    with a single socket if the NLM_F_DUMP isn't set.

    The information about a socket includes reference, address, connection
    information / publication information.

    Netlink logical layout of response message:
    -> socket
    -> reference
    -> address
    [
    -> connection
    -> node
    -> socket
    [
    -> connected flag
    -> type
    -> instance
    ]
    ]
    [
    -> publication flag
    ]

    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     

24 Aug, 2014

5 commits

  • We complete the merging of the port and socket layer by aggregating
    the fields of struct tipc_port directly into struct tipc_sock, and
    moving the combined structure into socket.c.

    We also move all functions and macros that are not any longer
    exposed to the rest of the stack into socket.c, and rename them
    accordingly.

    Despite the size of this commit, there are no functional changes.
    We have only made such changes that are necessary due of the removal
    of struct tipc_port.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The reference table is now 'socket aware' instead of being generic,
    and has in reality become a socket internal table. In order to be
    able to minimize the API exposed by the socket layer towards the rest
    of the stack, we now move the reference table definitions and functions
    into the file socket.c, and rename the functions accordingly.

    There are no functional changes in this commit.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • We move the inline functions in the file port.h to socket.c, and modify
    their names accordingly.

    We move struct tipc_port and some macros to socket.h.

    Finally, we remove the file port.h.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The functions tipc_port_get_ports() and tipc_port_reinit() scan over
    all sockets/ports to access each of them. This is done by using a
    dedicated linked list, 'tipc_socks' where all sockets are members. The
    list is in turn protected by a spinlock, 'port_list_lock', while each
    socket is locked by using port_lock at the moment of access.

    In order to reduce complexity and risk of deadlock, we want to get
    rid of the linked list and the accompanying spinlock.

    This is what we do in this commit. Instead of the linked list, we use
    the port registry to scan across the sockets. We also add usage of
    bh_lock_sock() inside the scope of port_lock in both functions, as a
    preparation for the complete removal of port_lock.

    Finally, we move the functions from port.c to socket.c, and rename them
    to tipc_sk_sock_show() and tipc_sk_reinit() repectively.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The current link implementation keeps a linked list of blocked ports/
    sockets that is populated when there is link congestion. The purpose
    of this is to let the link know which users to wake up when the
    congestion abates.

    This adds unnecessary complexity to the data structure and the code,
    since it forces us to involve the link each time we want to delete
    a socket. It also forces us to grab the spinlock port_lock within
    the scope of node_lock. We want to get rid of this direct dependence,
    as well as the deadlock hazard resulting from the usage of port_lock.

    In this commit, we instead let the link keep list of a "wakeup" pseudo
    messages for use in such situations. Those messages are sent to the
    pending sockets via the ordinary message reception path, and wake up
    the socket's owner when they are received.

    This enables us to get rid of the 'waiting_ports' linked lists in struct
    tipc_port that manifest this direct reference. As a consequence, we can
    eliminate another BH entry into the socket, and hence the need to grab
    port_lock. This is a further step in our effort to remove port_lock
    altogether.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

17 Jul, 2014

1 commit

  • We add a new broadcast link transmit function in bclink.c and a new
    receive function in socket.c. The purpose is to move the branching
    between external and internal destination down to the link layer,
    just as we have done with unicast in earlier commits. We also make
    use of the new link-independent fragmentation support that was
    introduced in an earlier commit series.

    This gives a shorter and simpler code path, and makes it possible
    to obtain copy-free buffer delivery to all node local destination
    sockets.

    The new transmission code is added in parallel with the existing one,
    and will be used by the socket multicast send function in the next
    commit in this series.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

28 Jun, 2014

2 commits

  • As a consequence of the recently introduced serialized access
    to the socket in commit 8d94168a761819d10252bab1f8de6d7b202c3baa
    ("tipc: same receive code path for connection protocol and data
    messages") we can make a number of simplifications in the
    detection and handling of connection congestion situations.

    - We don't need to keep two counters, one for sent messages and one
    for acked messages. There is no longer any risk for races between
    acknowledge messages arriving in BH and data message sending
    running in user context. So we merge this into one counter,
    'sent_unacked', which is incremented at sending and subtracted
    from at acknowledge reception.

    - We don't need to set the 'congested' field in tipc_port to
    true before we sent the message, and clear it when sending
    is successful. (As a matter of fact, it was never necessary;
    the field was set in link_schedule_port() before any wakeup
    could arrive anyway.)

    - We keep the conditions for link congestion and connection connection
    congestion separated. There would otherwise be a risk that an arriving
    acknowledge message may wake up a user sleeping because of link
    congestion.

    - We can simplify reception of acknowledge messages.

    We also make some cosmetic/structural changes:

    - We rename the 'congested' field to the more correct 'link_cong´.

    - We rename 'conn_unacked' to 'rcv_unacked'

    - We move the above mentioned fields from struct tipc_port to
    struct tipc_sock.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • We simplify the code for receiving connection probes, leveraging the
    recently introduced tipc_msg_reverse() function. We also stick to
    the principle of sending a possible response message directly from
    the calling (tipc_sk_rcv or backlog_rcv) functions, hence making
    the call chain shallower and easier to follow.

    We make one small protocol change here, allowed according to
    the spec. If a protocol message arrives from a remote socket that
    is not the one we are connected to, we are currently generating a
    connection abort message and send it to the source. This behavior
    is unnecessary, and might even be a security risk, so instead we
    now choose to only ignore the message. The consequnce for the sender
    is that he will need longer time to discover his mistake (until the
    next timeout), but this is an extreme corner case, and may happen
    anyway under other circumstances, so we deem this change acceptable.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

15 May, 2014

2 commits

  • In order to reduce complexity and save a call level during message
    reception at port/socket level, we remove the function tipc_port_rcv()
    and merge its functionality into tipc_sk_rcv().

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The function net/core/sock.c::__release_sock() runs a tight loop
    to move buffers from the socket backlog queue to the receive queue.

    As a security measure, sk_backlog.len of the receiving socket
    is not set to zero until after the loop is finished, i.e., until
    the whole backlog queue has been transferred to the receive queue.
    During this transfer, the data that has already been moved is counted
    both in the backlog queue and the receive queue, hence giving an
    incorrect picture of the available queue space for new arriving buffers.

    This leads to unnecessary rejection of buffers by sk_add_backlog(),
    which in TIPC leads to unnecessarily broken connections.

    In this commit, we compensate for this double accounting by adding
    a counter that keeps track of it. The function socket.c::backlog_rcv()
    receives buffers one by one from __release_sock(), and adds them to the
    socket receive queue. If the transfer is successful, it increases a new
    atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the
    transferred buffer. If a new buffer arrives during this transfer and
    finds the socket busy (owned), we attempt to add it to the backlog.
    However, when sk_add_backlog() is called, we adjust the 'limit'
    parameter with the value of the new counter, so that the risk of
    inadvertent rejection is eliminated.

    It should be noted that this change does not invalidate the original
    purpose of zeroing 'sk_backlog.len' after the full transfer. We set an
    upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that
    doesn't respect the send window) keeps pumping in buffers to
    sk_add_backlog(), he will eventually reach an upper limit,
    (2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added
    to the backlog, and the connection will be broken. Ordinary, well-
    behaved senders will never reach this buffer limit at all.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

13 Mar, 2014

3 commits

  • The practice of naming variables in TIPC is inconistent, sometimes
    even within the same file.

    In this commit we align variable names and declarations within
    socket.c, and function and macro names within socket.h. We also
    reduce the number of conversion macros to two, in order to make
    usage less obsure.

    These changes are purely cosmetic.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Reviewed-by: Erik Hugne
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Due to the original one-to-many relation between port and user API
    layers, upcalls to the API have been performed via function pointers,
    installed in struct tipc_port at creation. Since this relation now
    always is one-to-one, we can instead use ordinary function calls.

    We remove the function pointers 'dispatcher' and ´wakeup' from
    struct tipc_port, and replace them with calls to the renamed
    functions tipc_sk_rcv() and tipc_sk_wakeup().

    At the same time we change the name and signature of the functions
    tipc_createport() and tipc_deleteport() to reflect their new role
    as mere initialization/destruction functions.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Reviewed-by: Erik Hugne
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • After the removal of the tipc native API the relation between
    a tipc_port and its API types is strictly one-to-one, i.e, the
    latter can now only be a socket API. There is therefore no need
    to allocate struct tipc_port and struct sock independently.

    In this commit, we aggregate struct tipc_port into struct tipc_sock,
    hence saving both CPU cycles and structure complexity.

    There are no functional changes in this commit, except for the
    elimination of the separate allocation/freeing of tipc_port.
    All other changes are just adaptatons to the new data structure.

    This commit also opens up for further code simplifications and
    code volume reduction, something we will do in later commits.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Reviewed-by: Erik Hugne
    Signed-off-by: David S. Miller

    Jon Paul Maloy