02 Oct, 2019

1 commit

  • We have identified a problem with the "oversubscription" policy in the
    link transmission code.

    When small messages are transmitted, and the sending link has reached
    the transmit window limit, those messages will be bundled and put into
    the link backlog queue. However, bundles of data messages are counted
    at the 'CRITICAL' level, so that the counter for that level, instead of
    the counter for the real, bundled message's level is the one being
    increased.
    Subsequent, to-be-bundled data messages at non-CRITICAL levels continue
    to be tested against the unchanged counter for their own level, while
    contributing to an unrestrained increase at the CRITICAL backlog level.

    This leaves a gap in congestion control algorithm for small messages
    that can result in starvation for other users or a "real" CRITICAL
    user. Even that eventually can lead to buffer exhaustion & link reset.

    We fix this by keeping a 'target_bskb' buffer pointer at each levels,
    then when bundling, we only bundle messages at the same importance
    level only. This way, we know exactly how many slots a certain level
    have occupied in the queue, so can manage level congestion accurately.

    By bundling messages at the same level, we even have more benefits. Let
    consider this:
    - One socket sends 64-byte messages at the 'CRITICAL' level;
    - Another sends 4096-byte messages at the 'LOW' level;

    When a 64-byte message comes and is bundled the first time, we put the
    overhead of message bundle to it (+ 40-byte header, data copy, etc.)
    for later use, but the next message can be a 4096-byte one that cannot
    be bundled to the previous one. This means the last bundle carries only
    one payload message which is totally inefficient, as for the receiver
    also! Later on, another 64-byte message comes, now we make a new bundle
    and the same story repeats...

    With the new bundling algorithm, this will not happen, the 64-byte
    messages will be bundled together even when the 4096-byte message(s)
    comes in between. However, if the 4096-byte messages are sent at the
    same level i.e. 'CRITICAL', the bundling algorithm will again cause the
    same overhead.

    Also, the same will happen even with only one socket sending small
    messages at a rate close to the link transmit's one, so that, when one
    message is bundled, it's transmitted shortly. Then, another message
    comes, a new bundle is created and so on...

    We will solve this issue radically by another patch.

    Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
    Reported-by: Hoang Le
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

26 Jul, 2019

1 commit

  • In conjunction with changing the interfaces' MTU (e.g. especially in
    the case of a bonding) where the TIPC links are brought up and down
    in a short time, a couple of issues were detected with the current link
    changeover mechanism:

    1) When one link is up but immediately forced down again, the failover
    procedure will be carried out in order to failover all the messages in
    the link's transmq queue onto the other working link. The link and node
    state is also set to FAILINGOVER as part of the process. The message
    will be transmited in form of a FAILOVER_MSG, so its size is plus of 40
    bytes (= the message header size). There is no problem if the original
    message size is not larger than the link's MTU - 40, and indeed this is
    the max size of a normal payload messages. However, in the situation
    above, because the link has just been up, the messages in the link's
    transmq are almost SYNCH_MSGs which had been generated by the link
    synching procedure, then their size might reach the max value already!
    When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size
    will exceed the link's MTU. As a result, the messages are dropped
    silently and the failover procedure will never end up, the link will
    not be able to exit the FAILINGOVER state, so cannot be re-established.

    2) The same scenario above can happen more easily in case the MTU of
    the links is set differently or when changing. In that case, as long as
    a large message in the failure link's transmq queue was built and
    fragmented with its link's MTU > the other link's one, the issue will
    happen (there is no need of a link synching in advance).

    3) The link synching procedure also faces with the same issue but since
    the link synching is only started upon receipt of a SYNCH_MSG, dropping
    the message will not result in a state deadlock, but it is not expected
    as design.

    The 1) & 3) issues are resolved by the last commit that only a dummy
    SYNCH_MSG (i.e. without data) is generated at the link synching, so the
    size of a FAILOVER_MSG if any then will never exceed the link's MTU.

    For the 2) issue, the only solution is trying to fragment the messages
    in the failure link's transmq queue according to the working link's MTU
    so they can be failovered then. A new function is made to accomplish
    this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the
    original message size is too large, it will be fragmented & reassembled
    at the receiving side.

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

30 Sep, 2018

2 commits

  • Default socket receive buffer size for a listener socket is 2Mb. For
    each arriving empty SYN, the linux kernel allocates a 768 bytes buffer.
    This means that a listener socket can serve maximum 2700 simultaneous
    empty connection setup requests before it hits a receive buffer
    overflow, and much fewer if the SYN is carrying any significant
    amount of data.

    When this happens the setup request is rejected, and the client
    receives an ECONNREFUSED error.

    This commit mitigates this problem by letting the client socket try to
    retransmit the SYN message multiple times when it sees it rejected with
    the code TIPC_ERR_OVERLOAD. Retransmission is done at random intervals
    in the range of [100 ms, setup_timeout / 4], as many times as there is
    room for within the setup timeout limit.

    Signed-off-by: Tung Nguyen
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     
  • The function tipc_msg_reverse() is reversing the header of a message
    while reusing the original buffer. We have seen at several occasions
    that this may have unfortunate side effects when the buffer to be
    reversed is a clone.

    In one of the following commits we will again need to reverse cloned
    buffers, so this is the right time to permanently eliminate this
    problem. In this commit we let the said function always consume the
    original buffer and replace it with a new one when applicable.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

30 Jun, 2018

1 commit

  • The function tipc_msg_extract() is using skb_clone() to clone inner
    messages from a message bundle buffer. Although this method is safe,
    it has an undesired effect that each buffer clone inherits the
    true-size of the bundling buffer. As a result, the buffer clone
    almost always ends up with being copied anyway by the message
    validation function. This makes the cloning into a sub-optimization.

    In this commit we take the consequence of this realization, and copy
    each inner message to a separately allocated buffer up front in the
    extraction function.

    As a bonus we can now eliminate the two cases where we had to copy
    re-routed packets that may potentially go out on the wire again.

    Signed-off-by: Tung Nguyen
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     

18 Mar, 2018

1 commit

  • Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
    aspects handled the same way, both on the publishing node and on the
    receiving nodes.

    Despite previous ambitions to the contrary, this is never going to change,
    so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related
    macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt
    using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we
    remain compatible with users and remote nodes still using ZONE_SCOPE.

    Furthermore, the non-formalized scope value 0 has always been permitted
    for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE.
    We now permit it even as binding scope, but for compatibility reasons we
    choose to not change the value of TIPC_CLUSTER_SCOPE.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

09 Feb, 2018

1 commit

  • In commit d618d09a68e4 ("tipc: enforce valid ratio between skb truesize
    and contents") we introduced a test for ensuring that the condition
    truesize/datasize truesize / buf_roundup_len(skb) > 4) will miss all
    ratios [4 < ratio < 5], which was not the intention.
    - The buffer returned by skb_copy() inherits skb->truesize of the
    original buffer, which doesn't help the situation at all.

    In this commit, we change the ratio condition and replace skb_copy()
    with a call to skb_copy_expand() to finally get this right.

    Acked-by: Jon Maloy
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Hoang Le
     

02 Dec, 2017

1 commit

  • When sending node local messages the code is using an 'mtu' of 66060
    bytes to avoid unnecessary fragmentation. During situations of low
    memory tipc_msg_build() may sometimes fail to allocate such large
    buffers, resulting in unnecessary send failures. This can easily be
    remedied by falling back to a smaller MTU, and then reassemble the
    buffer chain as if the message were arriving from a remote node.

    At the same time, we change the initial MTU setting of the broadcast
    link to a lower value, so that large messages always are fragmented
    into smaller buffers even when we run in single node mode. Apart from
    obtaining the same advantage as for the 'fallback' solution above, this
    turns out to give a significant performance improvement. This can
    probably be explained with the __pskb_copy() operation performed on the
    buffer for each recipient during reception. We found the optimal value
    for this, considering the most relevant skb pool, to be 3744 bytes.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

16 Nov, 2017

1 commit


13 Oct, 2017

1 commit

  • In the following commits we will need to handle multiple incoming and
    rejected/returned buffers in the function socket.c::filter_rcv().
    As a preparation for this, we generalize the function by handling
    buffer queues instead of individual buffers. We also introduce a
    help function tipc_skb_reject(), and rename filter_rcv() to
    tipc_sk_filter_rcv() in line with other functions in socket.c.

    Signed-off-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Maloy
     

09 Oct, 2017

1 commit

  • When a bundling message is received, the function tipc_link_input()
    calls function tipc_msg_extract() to unbundle all inner messages of
    the bundling message before adding them to input queue.

    The function tipc_msg_extract() just clones all inner skb for all
    inner messagges from the bundling skb. This means that the skb
    headroom of an inner message overlaps with the data part of the
    preceding message in the bundle.

    If the message in question is a name addressed message, it may be
    subject to a secondary destination lookup, and eventually be sent out
    on one of the interfaces again. But, since what is perceived as headroom
    by the device driver in reality is the last bytes of the preceding
    message in the bundle, the latter will be overwritten by the MAC
    addresses of the L2 header. If the preceding message has not yet been
    consumed by the user, it will evenually be delivered with corrupted
    contents.

    This commit fixes this by uncloning all messages passing through the
    function tipc_msg_lookup_dest(), hence ensuring that the headroom
    is always valid when the message is passed on.

    Signed-off-by: Tung Nguyen
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

01 Oct, 2017

1 commit

  • In commit e3a77561e7d32 ("tipc: split up function tipc_msg_eval()"),
    we have updated the function tipc_msg_lookup_dest() to set the error
    codes to negative values at destination lookup failures. Thus when
    the function sets the error code to -TIPC_ERR_NO_NAME, its inserted
    into the 4 bit error field of the message header as 0xf instead of
    TIPC_ERR_NO_NAME (1). The value 0xf is an unknown error code.

    In this commit, we set only positive error code.

    Fixes: e3a77561e7d32 ("tipc: split up function tipc_msg_eval()")
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

25 Aug, 2017

1 commit

  • In tipc_msg_reverse(), we assign skb attributes to local pointers
    in stack at startup. This is followed by skb_linearize() and for
    cloned buffers we perform skb relocation using pskb_expand_head().
    Both these methods may update the skb attributes and thus making
    the pointers incorrect.

    In this commit, we fix this error by ensuring that the pointers
    are re-assigned after any of these skb operations.

    Fixes: 29042e19f2c60 ("tipc: let function tipc_msg_reverse() expand header
    when needed")
    Signed-off-by: Parthasarathy Bhuvaragan
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

15 Aug, 2017

1 commit

  • In the function msg_reverse(), we reverse the header while trying to
    reuse the original buffer whenever possible. Those rejected/returned
    messages are always transmitted as unicast, but the msg_non_seq field
    is not explicitly set to zero as it should be.

    We have seen cases where multicast senders set the message type to
    "NOT dest_droppable", meaning that a multicast message shorter than
    one MTU will be returned, e.g., during receive buffer overflow, by
    reusing the original buffer. This has the effect that even the
    'msg_non_seq' field is inadvertently inherited by the rejected message,
    although it is now sent as a unicast message. This again leads the
    receiving unicast link endpoint to steer the packet toward the broadcast
    link receive function, where it is dropped. The affected unicast link is
    thereafter (after 100 failed retransmissions) declared 'stale' and
    reset.

    We fix this by unconditionally setting the 'msg_non_seq' flag to zero
    for all rejected/returned messages.

    Reported-by: Canh Duc Luu
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

11 Jun, 2017

1 commit

  • The kernel may sleep under a rcu read lock in tipc_msg_reverse, and the
    function call path is:
    tipc_l2_rcv_msg (acquire the lock by rcu_read_lock)
    tipc_rcv
    tipc_sk_rcv
    tipc_msg_reverse
    pskb_expand_head(GFP_KERNEL) --> may sleep
    tipc_node_broadcast
    tipc_node_xmit_skb
    tipc_node_xmit
    tipc_sk_rcv
    tipc_msg_reverse
    pskb_expand_head(GFP_KERNEL) --> may sleep

    To fix it, "GFP_KERNEL" is replaced with "GFP_ATOMIC".

    Signed-off-by: Jia-Ju Bai
    Signed-off-by: David S. Miller

    Jia-Ju Bai
     

21 Jan, 2017

1 commit

  • TIPC multicast messages are currently carried over a reliable
    'broadcast link', making use of the underlying media's ability to
    transport packets as L2 broadcast or IP multicast to all nodes in
    the cluster.

    When the used bearer is lacking that ability, we can instead emulate
    the broadcast service by replicating and sending the packets over as
    many unicast links as needed to reach all identified destinations.
    We now introduce a new TIPC link-level 'replicast' service that does
    this.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

17 Jan, 2017

1 commit

  • Until now, we allocate memory always with GFP_ATOMIC flag.
    When the system is under memory pressure and a user tries to send,
    the send fails due to low memory. However, the user application
    can wait for free memory if we allocate it using GFP_KERNEL flag.

    In this commit, we use allocate memory with GFP_KERNEL for all user
    allocation.

    Reported-by: Rune Torgersen
    Acked-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

06 Dec, 2016

1 commit

  • copy_from_iter_full(), copy_from_iter_full_nocache() and
    csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
    et.al., advancing iterator only in case of successful full copy
    and returning whether it had been successful or not.

    Convert some obvious users. *NOTE* - do not blindly assume that
    something is a good candidate for those unless you are sure that
    not advancing iov_iter in failure case is the right thing in
    this case. Anything that does short read/short write kind of
    stuff (or is in a loop, etc.) is unlikely to be a good one.

    Signed-off-by: Al Viro

    Al Viro
     

23 Jun, 2016

1 commit

  • When extracting an individual message from a received "bundle" buffer,
    we just create a clone of the base buffer, and adjust it to point into
    the right position of the linearized data area of the latter. This works
    well for regular message reception, but during periods of extremely high
    load it may happen that an extracted buffer, e.g, a connection probe, is
    reversed and forwarded through an external interface while the preceding
    extracted message is still unhandled. When this happens, the header or
    data area of the preceding message will be partially overwritten by a
    MAC header, leading to unpredicatable consequences, such as a link
    reset.

    We now fix this by ensuring that the msg_reverse() function never
    returns a cloned buffer, and that the returned buffer always contains
    sufficient valid head and tail room to be forwarded.

    Reported-by: Erik Hugne
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

24 Oct, 2015

3 commits

  • The code path for receiving broadcast packets is currently distinct
    from the unicast path. This leads to unnecessary code and data
    duplication, something that can be avoided with some effort.

    We now introduce separate per-peer tipc_link instances for handling
    broadcast packet reception. Each receive link keeps a pointer to the
    common, single, broadcast link instance, and can hence handle release
    and retransmission of send buffers as if they belonged to the own
    instance.

    Furthermore, we let each unicast link instance keep a reference to both
    the pertaining broadcast receive link, and to the common send link.
    This makes it possible for the unicast links to easily access data for
    broadcast link synchronization, as well as for carrying acknowledges for
    received broadcast packets.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • This commit simplifies the broadcast link transmission function, by
    leveraging previous changes to the link transmission function and the
    broadcast transmission link life cycle.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Conflicts:
    net/ipv6/xfrm6_output.c
    net/openvswitch/flow_netlink.c
    net/openvswitch/vport-gre.c
    net/openvswitch/vport-vxlan.c
    net/openvswitch/vport.c
    net/openvswitch/vport.h

    The openvswitch conflicts were overlapping changes. One was
    the egress tunnel info fix in 'net' and the other was the
    vport ->send() op simplification in 'net-next'.

    The xfrm6_output.c conflicts was also a simplification
    overlapping a bug fix.

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Oct, 2015

1 commit

  • The current code for message reassembly is erroneously assuming that
    the the first arriving fragment buffer always is linear, and then goes
    ahead resetting the fragment list of that buffer in anticipation of
    more arriving fragments.

    However, if the buffer already happens to be non-linear, we will
    inadvertently drop the already attached fragment list, and later
    on trig a BUG() in __pskb_pull_tail().

    We see this happen when running fragmented TIPC multicast across UDP,
    something made possible since
    commit d0f91938bede ("tipc: add ip/udp media type")

    We fix this by not resetting the fragment list when the buffer is non-
    linear, and by initiatlizing our private fragment list tail pointer to
    the tail of the existing fragment list.

    Fixes: commit d0f91938bede ("tipc: add ip/udp media type")
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

16 Oct, 2015

1 commit

  • After the previous commits, we are guaranteed that no packets
    of type LINK_PROTOCOL or with illegal sequence numbers will be
    attempted added to the link deferred queue. This makes it possible to
    make some simplifications to the sorting algorithm in the function
    tipc_skb_queue_sorted().

    We also alter the function so that it will drop packets if one with
    the same seqeunce number is already present in the queue. This is
    necessary because we have identified weird packet sequences, involving
    duplicate packets, where a legitimate in-sequence packet may advance to
    the head of the queue without being detected and de-queued.

    Finally, we make this function outline, since it will now be called only
    in exceptional cases.

    Signed-off-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

21 Sep, 2015

1 commit

  • The msg pointer into header may change after skb linearization.
    We must reinitialize it after calling skb_linearize to prevent
    operating on a freed or invalid pointer.

    Signed-off-by: Erik Hugne
    Reported-by: Tamás Végh
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Erik Hugne
     

27 Jul, 2015

3 commits

  • When a message is received in a socket, one of the call chains
    tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
    or
    tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
    are followed. At each of these levels we may encounter situations
    where the message may need to be rejected, or a new message
    produced for transfer back to the sender. Despite recent
    improvements, the current code for doing this is perceived
    as awkward and hard to follow.

    Leveraging the two previous commits in this series, we now
    introduce a more uniform handling of such situations. We
    let each of the functions in the chain itself produce/reverse
    the message to be returned to the sender, but also perform the
    actual forwarding. This simplifies the necessary logics within
    each function.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Currently, we use the code sequence

    if (msg_reverse())
    tipc_link_xmit_skb()

    at numerous locations in socket.c. The preparation of arguments
    for these calls, as well as the sequence itself, makes the code
    unecessarily complex.

    In this commit, we introduce a new function, tipc_sk_respond(),
    that performs this call combination. We also replace some, but not
    yet all, of these explicit call sequences with calls to the new
    function. Notably, we let the function tipc_sk_proto_rcv() use
    the new function to directly send out PROBE_REPLY messages,
    instead of deferring this to the calling tipc_sk_rcv() function,
    as we do now.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The shortest TIPC message header, for cluster local CONNECTED messages,
    is 24 bytes long. With this format, the fields "dest_node" and
    "orig_node" are optimized away, since they in reality are redundant
    in this particular case.

    However, the absence of these fields leads to code inconsistencies
    that are difficult to handle in some cases, especially when we need
    to reverse or reject messages at the socket layer.

    In this commit, we concentrate the handling of the absent fields
    to one place, by letting the function tipc_msg_reverse() reallocate
    the buffer and expand the header to 32 bytes when necessary. This
    means that the socket code now can assume that the two previously
    absent fields are present in the header when a message needs to be
    rejected. This opens up for some further simplifications of the
    socket code.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

15 May, 2015

2 commits

  • Currently, the packet sequence number is updated and added to each
    packet at the moment a packet is added to the link backlog queue.
    This is wasteful, since it forces the code to traverse the send
    packet list packet by packet when adding them to the backlog queue.
    It would be better to just splice the whole packet list into the
    backlog queue when that is the right action to do.

    In this commit, we do this change. Also, since the sequence numbers
    cannot now be assigned to the packets at the moment they are added
    the backlog queue, we do instead calculate and add them at the moment
    of transmission, when the backlog queue has to be traversed anyway.
    We do this in the function tipc_link_push_packet().

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The link congestion algorithm used until now implies two problems.

    - It is too generous towards lower-level messages in situations of high
    load by giving "absolute" bandwidth guarantees to the different
    priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
    20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
    available bandwidth. But, in the absence of higher level traffic, the
    ratio between two distinct levels becomes unreasonable. E.g. if there
    is only LOW and MEDIUM traffic on a system, the former is guaranteed
    1/3 of the bandwidth, and the latter 2/3. This again means that if
    there is e.g. one LOW user and 10 MEDIUM users, the former will have
    33.3% of the bandwidth, and the others will have to compete for the
    remainder, i.e. each will end up with 6.7% of the capacity.

    - Packets of type MSG_BUNDLER are created at SYSTEM importance level,
    but only after the packets bundled into it have passed the congestion
    test for their own respective levels. Since bundled packets don't
    result in incrementing the level counter for their own importance,
    only occasionally for the SYSTEM level counter, they do in practice
    obtain SYSTEM level importance. Hence, the current implementation
    provides a gap in the congestion algorithm that in the worst case
    may lead to a link reset.

    We now refine the congestion algorithm as follows:

    - A message is accepted to the link backlog only if its own level
    counter, and all superior level counters, permit it.

    - The importance of a created bundle packet is set according to its
    contents. A bundle packet created from messges at levels LOW to
    CRITICAL is given importance level CRITICAL, while a bundle created
    from a SYSTEM level message is given importance SYSTEM. In the latter
    case only subsequent SYSTEM level messages are allowed to be bundled
    into it.

    This solves the first problem described above, by making the bandwidth
    guarantee relative to the total number of users at all levels; only
    the upper limit for each level remains absolute. In the example
    described above, the single LOW user would use 1/11th of the bandwidth,
    the same as each of the ten MEDIUM users, but he still has the same
    guarantee against starvation as the latter ones.

    The fix also solves the second problem. If the CRITICAL level is filled
    up by bundle packets of that level, no lower level packets will be
    accepted any more.

    Suggested-by: Gergely Kiss
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

03 Apr, 2015

1 commit

  • When a bearer is disabled manually, all its links have to be reset
    and deleted. However, if there is a remaining, parallel link ready
    to take over a deleted link's traffic, we currently delay the delete
    of the removed link until the failover procedure is finished. This
    is because the remaining link needs to access state from the reset
    link, such as the last received packet number, and any partially
    reassembled buffer, in order to perform a successful failover.

    In this commit, we do instead move the state data over to the new
    link, so that it can fulfill the procedure autonomously, without
    accessing any data on the old link. This means that we can now
    proceed and delete all pertaining links immediately when a bearer
    is disabled. This saves us from some unnecessary complexity in such
    situations.

    We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
    ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
    FAILOVER_MSG and SYNCH_MSG respectively.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

30 Mar, 2015

1 commit

  • A message sent to a node after a successful name table lookup may still
    find that the destination socket has disappeared, because distribution
    of name table updates is non-atomic. If so, the message will be rejected
    back to the sender with error code TIPC_ERR_NO_PORT. If the source
    socket of the message has disappeared in the meantime, the message
    should be dropped.

    However, in the currrent code, the message will instead be subject to an
    unwanted tertiary lookup, because the function tipc_msg_lookup_dest()
    doesn't check if there is an error code present in the message before
    performing the lookup. In the worst case, the message may now find the
    old destination again, and be redirected once more, instead of being
    dropped directly as it should be.

    A second bug in this function is that the "prev_node" field in the message
    is not updated after successful lookup, something that may have
    unpredictable consequences.

    The problems arising from those bugs occur very infrequently.

    The third change in this function; the test on msg_reroute_msg_cnt() is
    purely cosmetic, reflecting that the returned value never can be negative.

    This commit corrects the two bugs described above.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

15 Mar, 2015

5 commits

  • Messages transferred by TIPC are assigned an "importance priority", -an
    integer value indicating how to treat the message when there is link or
    destination socket congestion.

    There is no separate header field for this value. Instead, the message
    user values have been chosen in ascending order according to perceived
    importance, so that the message user field can be used for this.

    This is not a good solution. First, we have many more users than the
    needed priority levels, so we end up with treating more priority
    levels than necessary. Second, the user field cannot always
    accurately reflect the priority of the message. E.g., a message
    fragment packet should really have the priority of the enveloped
    user data message, and not the priority of the MSG_FRAGMENTER user.
    Until now, we have been working around this problem in different ways,
    but it is now time to implement a consistent way of handling such
    priorities, although still within the constraint that we cannot
    allocate any more bits in the regular data message header for this.

    In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
    that will be the only one used apart from the four (lower) user data
    levels. All non-data messages map down to this priority. Furthermore,
    we take some free bits from the MSG_FRAGMENTER header and allocate
    them to store the priority of the enveloped message. We then adjust
    the functions msg_importance()/msg_set_importance() so that they
    read/set the correct header fields depending on user type.

    This small protocol change is fully compatible, because the code at
    the receiving end of a link currently reads the importance level
    only from user data messages, where there is no change.

    Reviewed-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • struct tipc_link contains one single queue for outgoing packets,
    where both transmitted and waiting packets are queued.

    This infrastructure is hard to maintain, because we need
    to keep a number of fields to keep track of which packets are
    sent or unsent, and the number of packets in each category.

    A lot of code becomes simpler if we split this queue into a transmission
    queue, where sent/unacknowledged packets are kept, and a backlog queue,
    where we keep the not yet sent packets.

    In this commit we do this separation.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • When we currently extract a bundled buffer from a message bundle in
    the function tipc_msg_extract(), we allocate a new buffer and explicitly
    copy the linear data area.

    This is unnecessary, since we can just clone the buffer and do
    skb_pull() on the clone to move the data pointer to the correct
    position.

    This is what we do in this commit.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Currently, TIPC linearizes all incoming buffers directly at reception
    before passing them upwards in the stack. This is clearly a waste of
    CPU resources, and must be avoided.

    In this commit, we eliminate this unnecessary linearization. We still
    ensure that at least the message header is linear, and that the buffer
    is linearized where this is still needed, i.e. when unbundling and when
    reversing messages.

    In addition, we ensure that fragmented messages are validated after
    reassembly before delivering them upwards in the stack.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The function link_buf_validate() is in reality re-entrant and context
    independent, and will in later commits be called from several locations.
    Therefore, we move it to msg.c, make it outline and rename the it to
    tipc_msg_validate().

    We also redesign the function to make proper use of pskb_may_pull()

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

06 Feb, 2015

3 commits

  • TIPC handles message cardinality and sequencing at the link layer,
    before passing messages upwards to the destination sockets. During the
    upcall from link to socket no locks are held. It is therefore possible,
    and we see it happen occasionally, that messages arriving in different
    threads and delivered in sequence still bypass each other before they
    reach the destination socket. This must not happen, since it violates
    the sequentiality guarantee.

    We solve this by adding a new input buffer queue to the link structure.
    Arriving messages are added safely to the tail of that queue by the
    link, while the head of the queue is consumed, also safely, by the
    receiving socket. Sequentiality is secured per socket by only allowing
    buffers to be dequeued inside the socket lock. Since there may be multiple
    simultaneous readers of the queue, we use a 'filter' parameter to reduce
    the risk that they peek the same buffer from the queue, hence also
    reducing the risk of contention on the receiving socket locks.

    This solves the sequentiality problem, and seems to cause no measurable
    performance degradation.

    A nice side effect of this change is that lock handling in the functions
    tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
    will enable future simplifications of those functions.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The function tipc_msg_eval() is in reality doing two related, but
    different tasks. First it tries to find a new destination for named
    messages, in case there was no first lookup, or if the first lookup
    failed. Second, it does what its name suggests, evaluating the validity
    of the message and its destination, and returning an appropriate error
    code depending on the result.

    This is confusing, and in this commit we choose to break it up into two
    functions. A new function, tipc_msg_lookup_dest(), first attempts to find
    a new destination, if the message is of the right type. If this lookup
    fails, or if the message should not be subject to a second lookup, the
    already existing tipc_msg_reverse() is called. This function performs
    prepares the message for rejection, if applicable.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The most common usage of namespace information is when we fetch the
    own node addess from the net structure. This leads to a lot of
    passing around of a parameter of type 'struct net *' between
    functions just to make them able to obtain this address.

    However, in many cases this is unnecessary. The own node address
    is readily available as a member of both struct tipc_sock and
    tipc_link, and can be fetched from there instead.
    The fact that the vast majority of functions in socket.c and link.c
    anyway are maintaining a pointer to their respective base structures
    makes this option even more compelling.

    In this commit, we introduce the inline functions tsk_own_node()
    and link_own_node() to make it easy for functions to fetch the node
    address from those structs instead of having to pass along and
    dereference the namespace struct.

    In particular, we make calls to the msg_xx() functions in msg.{h,c}
    context independent by directly passing them the own node address
    as parameter when needed. Those functions should be regarded as
    leaves in the code dependency tree, and it is hence desirable to
    keep them namspace unaware.

    Apart from a potential positive effect on cache behavior, these
    changes make it easier to introduce the changes that will follow
    later in this series.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy