04 May, 2016

1 commit

  • There are two flow control mechanisms in TIPC; one at link level that
    handles network congestion, burst control, and retransmission, and one
    at connection level which' only remaining task is to prevent overflow
    in the receiving socket buffer. In TIPC, the latter task has to be
    solved end-to-end because messages can not be thrown away once they
    have been accepted and delivered upwards from the link layer, i.e, we
    can never permit the receive buffer to overflow.

    Currently, this algorithm is message based. A counter in the receiving
    socket keeps track of number of consumed messages, and sends a dedicated
    acknowledge message back to the sender for each 256 consumed message.
    A counter at the sending end keeps track of the sent, not yet
    acknowledged messages, and blocks the sender if this number ever reaches
    512 unacknowledged messages. When the missing acknowledge arrives, the
    socket is then woken up for renewed transmission. This works well for
    keeping the message flow running, as it almost never happens that a
    sender socket is blocked this way.

    A problem with the current mechanism is that it potentially is very
    memory consuming. Since we don't distinguish between small and large
    messages, we have to dimension the socket receive buffer according
    to a worst-case of both. I.e., the window size must be chosen large
    enough to sustain a reasonable throughput even for the smallest
    messages, while we must still consider a scenario where all messages
    are of maximum size. Hence, the current fix window size of 512 messages
    and a maximum message size of 66k results in a receive buffer of 66 MB
    when truesize(66k) = 131k is taken into account. It is possible to do
    much better.

    This commit introduces an algorithm where we instead use 1024-byte
    blocks as base unit. This unit, always rounded upwards from the
    actual message size, is used when we advertise windows as well as when
    we count and acknowledge transmitted data. The advertised window is
    based on the configured receive buffer size in such a way that even
    the worst-case truesize/msgsize ratio always is covered. Since the
    smallest possible message size (from a flow control viewpoint) now is
    1024 bytes, we can safely assume this ratio to be less than four, which
    is the value we are now using.

    This way, we have been able to reduce the default receive buffer size
    from 66 MB to 2 MB with maintained performance.

    In order to keep this solution backwards compatible, we introduce a
    new capability bit in the discovery protocol, and use this throughout
    the message sending/reception path to always select the right unit.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

27 Jul, 2015

1 commit

  • When a message is received in a socket, one of the call chains
    tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
    or
    tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
    are followed. At each of these levels we may encounter situations
    where the message may need to be rejected, or a new message
    produced for transfer back to the sender. Despite recent
    improvements, the current code for doing this is perceived
    as awkward and hard to follow.

    Leveraging the two previous commits in this series, we now
    introduce a more uniform handling of such situations. We
    let each of the functions in the chain itself produce/reverse
    the message to be returned to the sender, but also perform the
    actual forwarding. This simplifies the necessary logics within
    each function.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

18 Mar, 2015

1 commit

  • When the TIPC module is loaded, we launch a topology server in kernel
    space, which in its turn is creating TIPC sockets for communication
    with topology server users. Because both the socket's creator and
    provider reside in the same module, it is necessary that the TIPC
    module's reference count remains zero after the server is started and
    the socket created; otherwise it becomes impossible to perform "rmmod"
    even on an idle module.

    Currently, we achieve this by defining a separate "tipc_proto_kern"
    protocol struct, that is used only for kernel space socket allocations.
    This structure has the "owner" field set to NULL, which restricts the
    module reference count from being be bumped when sk_alloc() for local
    sockets is called. Furthermore, we have defined three kernel-specific
    functions, tipc_sock_create_local(), tipc_sock_release_local() and
    tipc_sock_accept_local(), to avoid the module counter being modified
    when module local sockets are created or deleted. This has worked well
    until we introduced name space support.

    However, after name space support was introduced, we have observed that
    a reference count leak occurs, because the netns counter is not
    decremented in tipc_sock_delete_local().

    This commit remedies this problem. But instead of just modifying
    tipc_sock_delete_local(), we eliminate the whole parallel socket
    handling infrastructure, and start using the regular sk_create_kern(),
    kernel_accept() and sk_release_kernel() calls. Since those functions
    manipulate the module counter, we must now compensate for that by
    explicitly decrementing the counter after module local sockets are
    created, and increment it just before calling sk_release_kernel().

    Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace")
    Signed-off-by: Ying Xue
    Reviewed-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reported-by: Cong Wang
    Tested-by: Erik Hugne
    Signed-off-by: David S. Miller

    Ying Xue
     

10 Feb, 2015

1 commit

  • Convert socket (port) listing to compat dumpit call. If a socket
    (port) has publications a second dumpit call is issued to collect them
    and format then into the legacy buffer before continuing to process
    the sockets (ports).

    Command converted in this patch:
    TIPC_CMD_SHOW_PORTS

    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Richard Alpe
     

06 Feb, 2015

3 commits

  • In a previous commit in this series we resolved a race problem during
    unicast message reception.

    Here, we resolve the same problem at multicast reception. We apply the
    same technique: an input queue serializing the delivery of arriving
    buffers. The main difference is that here we do it in two steps.
    First, the broadcast link feeds arriving buffers into the tail of an
    arrival queue, which head is consumed at the socket level, and where
    destination lookup is performed. Second, if the lookup is successful,
    the resulting buffer clones are fed into a second queue, the input
    queue. This queue is consumed at reception in the socket just like
    in the unicast case. Both queues are protected by the same lock, -the
    one of the input queue.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The structure 'tipc_port_list' is used to collect port numbers
    representing multicast destination socket on a receiving node.
    The list is not based on a standard linked list, and is in reality
    optimized for the uncommon case that there are more than one
    multicast destinations per node. This makes the list handling
    unecessarily complex, and as a consequence, even the socket
    multicast reception becomes more complex.

    In this commit, we replace 'tipc_port_list' with a new 'struct
    tipc_plist', which is based on a standard list. We give the new
    list stack (push/pop) semantics, someting that simplifies
    the implementation of the function tipc_sk_mcast_rcv().

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • TIPC handles message cardinality and sequencing at the link layer,
    before passing messages upwards to the destination sockets. During the
    upcall from link to socket no locks are held. It is therefore possible,
    and we see it happen occasionally, that messages arriving in different
    threads and delivered in sequence still bypass each other before they
    reach the destination socket. This must not happen, since it violates
    the sequentiality guarantee.

    We solve this by adding a new input buffer queue to the link structure.
    Arriving messages are added safely to the tail of that queue by the
    link, while the head of the queue is consumed, also safely, by the
    receiving socket. Sequentiality is secured per socket by only allowing
    buffers to be dequeued inside the socket lock. Since there may be multiple
    simultaneous readers of the queue, we use a 'filter' parameter to reduce
    the risk that they peek the same buffer from the queue, hence also
    reducing the risk of contention on the receiving socket locks.

    This solves the sequentiality problem, and seems to cause no measurable
    performance degradation.

    A nice side effect of this change is that lock handling in the functions
    tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
    will enable future simplifications of those functions.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

13 Jan, 2015

4 commits

  • TIPC establishes one subscriber server which allows users to subscribe
    their interesting name service status. After tipc supports namespace,
    one dedicated tipc stack instance is created for each namespace, and
    each instance can be deemed as one independent TIPC node. As a result,
    subscriber server must be built for each namespace.

    Signed-off-by: Ying Xue
    Tested-by: Tero Aho
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Now tipc socket table is statically allocated as a global variable.
    Through it, we can look up one socket instance with port ID, insert
    a new socket instance to the table, and delete a socket from the
    table. But when tipc supports net namespace, each namespace must own
    its specific socket table. So the global variable of socket table
    must be redefined in tipc_net structure. As a concequence, a new
    socket table will be allocated when a new namespace is created, and
    a socket table will be deallocated when namespace is destroyed.

    Signed-off-by: Ying Xue
    Tested-by: Tero Aho
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Global variables associated with node table are below:
    - node table list (node_htable)
    - node hash table list (tipc_node_list)
    - node table lock (node_list_lock)
    - node number counter (tipc_num_nodes)
    - node link number counter (tipc_num_links)

    To make node table support namespace, above global variables must be
    moved to tipc_net structure in order to keep secret for different
    namespaces. As a consequence, these variables are allocated and
    initialized when namespace is created, and deallocated when namespace
    is destroyed. After the change, functions associated with these
    variables have to utilize a namespace pointer to access them. So
    adding namespace pointer as a parameter of these functions is the
    major change made in the commit.

    Signed-off-by: Ying Xue
    Tested-by: Tero Aho
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Only the works of initializing and shutting down tipc module are done
    in core.h and core.c files, so all stuffs which are not closely
    associated with the two tasks should be moved to appropriate places.

    Signed-off-by: Ying Xue
    Tested-by: Tero Aho
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     

09 Jan, 2015

1 commit

  • As tipc reference table is statically allocated, its memory size
    requested on stack initialization stage is quite big even if the
    maximum port number is just restricted to 8191 currently, however,
    the number already becomes insufficient in practice. But if the
    maximum ports is allowed to its theory value - 2^32, its consumed
    memory size will reach a ridiculously unacceptable value. Apart from
    this, heavy tipc users spend a considerable amount of time in
    tipc_sk_get() due to the read-lock on ref_table_lock.

    If tipc reference table is converted with generic rhashtable, above
    mentioned both disadvantages would be resolved respectively: making
    use of the new resizable hash table can avoid locking on the lookup;
    smaller memory size is required at initial stage, for example, 256
    hash bucket slots are requested at the beginning phase instead of
    allocating the entire 8191 slots in old mode. The hash table will
    grow if entries exceeds 75% of table size up to a total table size
    of 1M, and it will automatically shrink if usage falls below 30%,
    but the minimum table size is allowed down to 256.

    Also converts ref_table_lock to a separate mutex to protect hash table
    mutations on write side. Lastly defers the release of the socket
    reference using call_rcu() to allow using an RCU read-side protected
    call to rhashtable_lookup().

    Signed-off-by: Ying Xue
    Acked-by: Jon Maloy
    Acked-by: Erik Hugne
    Cc: Thomas Graf
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Ying Xue
     

22 Nov, 2014

2 commits

  • Add TIPC_NL_PUBL_GET command to the new tipc netlink API.

    This command supports dumping of all publications for a specific
    socket.

    Netlink logical layout of request message:
    -> socket
    -> reference

    Netlink logical layout of response message:
    -> publication
    -> type
    -> lower
    -> upper

    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Add TIPC_NL_SOCK_GET command to the new tipc netlink API.

    This command supports dumping of all available sockets with their
    associated connection or publication(s). It could be extended to reply
    with a single socket if the NLM_F_DUMP isn't set.

    The information about a socket includes reference, address, connection
    information / publication information.

    Netlink logical layout of response message:
    -> socket
    -> reference
    -> address
    [
    -> connection
    -> node
    -> socket
    [
    -> connected flag
    -> type
    -> instance
    ]
    ]
    [
    -> publication flag
    ]

    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     

24 Aug, 2014

5 commits

  • We complete the merging of the port and socket layer by aggregating
    the fields of struct tipc_port directly into struct tipc_sock, and
    moving the combined structure into socket.c.

    We also move all functions and macros that are not any longer
    exposed to the rest of the stack into socket.c, and rename them
    accordingly.

    Despite the size of this commit, there are no functional changes.
    We have only made such changes that are necessary due of the removal
    of struct tipc_port.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The reference table is now 'socket aware' instead of being generic,
    and has in reality become a socket internal table. In order to be
    able to minimize the API exposed by the socket layer towards the rest
    of the stack, we now move the reference table definitions and functions
    into the file socket.c, and rename the functions accordingly.

    There are no functional changes in this commit.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • We move the inline functions in the file port.h to socket.c, and modify
    their names accordingly.

    We move struct tipc_port and some macros to socket.h.

    Finally, we remove the file port.h.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The functions tipc_port_get_ports() and tipc_port_reinit() scan over
    all sockets/ports to access each of them. This is done by using a
    dedicated linked list, 'tipc_socks' where all sockets are members. The
    list is in turn protected by a spinlock, 'port_list_lock', while each
    socket is locked by using port_lock at the moment of access.

    In order to reduce complexity and risk of deadlock, we want to get
    rid of the linked list and the accompanying spinlock.

    This is what we do in this commit. Instead of the linked list, we use
    the port registry to scan across the sockets. We also add usage of
    bh_lock_sock() inside the scope of port_lock in both functions, as a
    preparation for the complete removal of port_lock.

    Finally, we move the functions from port.c to socket.c, and rename them
    to tipc_sk_sock_show() and tipc_sk_reinit() repectively.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The current link implementation keeps a linked list of blocked ports/
    sockets that is populated when there is link congestion. The purpose
    of this is to let the link know which users to wake up when the
    congestion abates.

    This adds unnecessary complexity to the data structure and the code,
    since it forces us to involve the link each time we want to delete
    a socket. It also forces us to grab the spinlock port_lock within
    the scope of node_lock. We want to get rid of this direct dependence,
    as well as the deadlock hazard resulting from the usage of port_lock.

    In this commit, we instead let the link keep list of a "wakeup" pseudo
    messages for use in such situations. Those messages are sent to the
    pending sockets via the ordinary message reception path, and wake up
    the socket's owner when they are received.

    This enables us to get rid of the 'waiting_ports' linked lists in struct
    tipc_port that manifest this direct reference. As a consequence, we can
    eliminate another BH entry into the socket, and hence the need to grab
    port_lock. This is a further step in our effort to remove port_lock
    altogether.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

17 Jul, 2014

1 commit

  • We add a new broadcast link transmit function in bclink.c and a new
    receive function in socket.c. The purpose is to move the branching
    between external and internal destination down to the link layer,
    just as we have done with unicast in earlier commits. We also make
    use of the new link-independent fragmentation support that was
    introduced in an earlier commit series.

    This gives a shorter and simpler code path, and makes it possible
    to obtain copy-free buffer delivery to all node local destination
    sockets.

    The new transmission code is added in parallel with the existing one,
    and will be used by the socket multicast send function in the next
    commit in this series.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

28 Jun, 2014

2 commits

  • As a consequence of the recently introduced serialized access
    to the socket in commit 8d94168a761819d10252bab1f8de6d7b202c3baa
    ("tipc: same receive code path for connection protocol and data
    messages") we can make a number of simplifications in the
    detection and handling of connection congestion situations.

    - We don't need to keep two counters, one for sent messages and one
    for acked messages. There is no longer any risk for races between
    acknowledge messages arriving in BH and data message sending
    running in user context. So we merge this into one counter,
    'sent_unacked', which is incremented at sending and subtracted
    from at acknowledge reception.

    - We don't need to set the 'congested' field in tipc_port to
    true before we sent the message, and clear it when sending
    is successful. (As a matter of fact, it was never necessary;
    the field was set in link_schedule_port() before any wakeup
    could arrive anyway.)

    - We keep the conditions for link congestion and connection connection
    congestion separated. There would otherwise be a risk that an arriving
    acknowledge message may wake up a user sleeping because of link
    congestion.

    - We can simplify reception of acknowledge messages.

    We also make some cosmetic/structural changes:

    - We rename the 'congested' field to the more correct 'link_cong´.

    - We rename 'conn_unacked' to 'rcv_unacked'

    - We move the above mentioned fields from struct tipc_port to
    struct tipc_sock.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • We simplify the code for receiving connection probes, leveraging the
    recently introduced tipc_msg_reverse() function. We also stick to
    the principle of sending a possible response message directly from
    the calling (tipc_sk_rcv or backlog_rcv) functions, hence making
    the call chain shallower and easier to follow.

    We make one small protocol change here, allowed according to
    the spec. If a protocol message arrives from a remote socket that
    is not the one we are connected to, we are currently generating a
    connection abort message and send it to the source. This behavior
    is unnecessary, and might even be a security risk, so instead we
    now choose to only ignore the message. The consequnce for the sender
    is that he will need longer time to discover his mistake (until the
    next timeout), but this is an extreme corner case, and may happen
    anyway under other circumstances, so we deem this change acceptable.

    Signed-off-by: Jon Maloy
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

15 May, 2014

2 commits

  • In order to reduce complexity and save a call level during message
    reception at port/socket level, we remove the function tipc_port_rcv()
    and merge its functionality into tipc_sk_rcv().

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The function net/core/sock.c::__release_sock() runs a tight loop
    to move buffers from the socket backlog queue to the receive queue.

    As a security measure, sk_backlog.len of the receiving socket
    is not set to zero until after the loop is finished, i.e., until
    the whole backlog queue has been transferred to the receive queue.
    During this transfer, the data that has already been moved is counted
    both in the backlog queue and the receive queue, hence giving an
    incorrect picture of the available queue space for new arriving buffers.

    This leads to unnecessary rejection of buffers by sk_add_backlog(),
    which in TIPC leads to unnecessarily broken connections.

    In this commit, we compensate for this double accounting by adding
    a counter that keeps track of it. The function socket.c::backlog_rcv()
    receives buffers one by one from __release_sock(), and adds them to the
    socket receive queue. If the transfer is successful, it increases a new
    atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the
    transferred buffer. If a new buffer arrives during this transfer and
    finds the socket busy (owned), we attempt to add it to the backlog.
    However, when sk_add_backlog() is called, we adjust the 'limit'
    parameter with the value of the new counter, so that the risk of
    inadvertent rejection is eliminated.

    It should be noted that this change does not invalidate the original
    purpose of zeroing 'sk_backlog.len' after the full transfer. We set an
    upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that
    doesn't respect the send window) keeps pumping in buffers to
    sk_add_backlog(), he will eventually reach an upper limit,
    (2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added
    to the backlog, and the connection will be broken. Ordinary, well-
    behaved senders will never reach this buffer limit at all.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

13 Mar, 2014

3 commits

  • The practice of naming variables in TIPC is inconistent, sometimes
    even within the same file.

    In this commit we align variable names and declarations within
    socket.c, and function and macro names within socket.h. We also
    reduce the number of conversion macros to two, in order to make
    usage less obsure.

    These changes are purely cosmetic.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Reviewed-by: Erik Hugne
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Due to the original one-to-many relation between port and user API
    layers, upcalls to the API have been performed via function pointers,
    installed in struct tipc_port at creation. Since this relation now
    always is one-to-one, we can instead use ordinary function calls.

    We remove the function pointers 'dispatcher' and ´wakeup' from
    struct tipc_port, and replace them with calls to the renamed
    functions tipc_sk_rcv() and tipc_sk_wakeup().

    At the same time we change the name and signature of the functions
    tipc_createport() and tipc_deleteport() to reflect their new role
    as mere initialization/destruction functions.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Reviewed-by: Erik Hugne
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • After the removal of the tipc native API the relation between
    a tipc_port and its API types is strictly one-to-one, i.e, the
    latter can now only be a socket API. There is therefore no need
    to allocate struct tipc_port and struct sock independently.

    In this commit, we aggregate struct tipc_port into struct tipc_sock,
    hence saving both CPU cycles and structure complexity.

    There are no functional changes in this commit, except for the
    elimination of the separate allocation/freeing of tipc_port.
    All other changes are just adaptatons to the new data structure.

    This commit also opens up for further code simplifications and
    code volume reduction, something we will do in later commits.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Reviewed-by: Erik Hugne
    Signed-off-by: David S. Miller

    Jon Paul Maloy