22 Aug, 2017

1 commit

  • When the broadcast send link after 100 attempts has failed to
    transfer a packet to all peers, we consider it stale, and reset
    it. Thereafter it needs to re-synchronize with the peers, something
    currently done by just resetting and re-establishing all links to
    all peers. This has turned out to be overkill, with potentially
    unwanted consequences for the remaining cluster.

    A closer analysis reveals that this can be done much simpler. When
    this kind of failure happens, for reasons that may lie outside the
    TIPC protocol, it is typically only one peer which is failing to
    receive and acknowledge packets. It is hence sufficient to identify
    and reset the links only to that peer to resolve the situation, without
    having to reset the broadcast link at all. This solution entails a much
    lower risk of negative consequences for the own node as well as for
    the overall cluster.

    We implement this change in this commit.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

14 Apr, 2017

1 commit


21 Jan, 2017

2 commits

  • If the bearer carrying multicast messages supports broadcast, those
    messages will be sent to all cluster nodes, irrespective of whether
    these nodes host any actual destinations socket or not. This is clearly
    wasteful if the cluster is large and there are only a few real
    destinations for the message being sent.

    In this commit we extend the eligibility of the newly introduced
    "replicast" transmit option. We now make it possible for a user to
    select which method he wants to be used, either as a mandatory setting
    via setsockopt(), or as a relative setting where we let the broadcast
    layer decide which method to use based on the ratio between cluster
    size and the message's actual number of destination nodes.

    In the latter case, a sending socket must stick to a previously
    selected method until it enters an idle period of at least 5 seconds.
    This eliminates the risk of message reordering caused by method change,
    i.e., when changes to cluster size or number of destinations would
    otherwise mandate a new method to be used.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • TIPC multicast messages are currently carried over a reliable
    'broadcast link', making use of the underlying media's ability to
    transport packets as L2 broadcast or IP multicast to all nodes in
    the cluster.

    When the used bearer is lacking that ability, we can instead emulate
    the broadcast service by replicating and sending the packets over as
    many unicast links as needed to reach all identified destinations.
    We now introduce a new TIPC link-level 'replicast' service that does
    this.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

18 Jan, 2017

1 commit


17 Jan, 2017

1 commit

  • Until now, we allocate memory always with GFP_ATOMIC flag.
    When the system is under memory pressure and a user tries to send,
    the send fails due to low memory. However, the user application
    can wait for free memory if we allocate it using GFP_KERNEL flag.

    In this commit, we use allocate memory with GFP_KERNEL for all user
    allocation.

    Reported-by: Rune Torgersen
    Acked-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

04 Jan, 2017

1 commit

  • The socket code currently handles link congestion by either blocking
    and trying to send again when the congestion has abated, or just
    returning to the user with -EAGAIN and let him re-try later.

    This mechanism is prone to starvation, because the wakeup algorithm is
    non-atomic. During the time the link issues a wakeup signal, until the
    socket wakes up and re-attempts sending, other senders may have come
    in between and occupied the free buffer space in the link. This in turn
    may lead to a socket having to make many send attempts before it is
    successful. In extremely loaded systems we have observed latency times
    of several seconds before a low-priority socket is able to send out a
    message.

    In this commit, we simplify this mechanism and reduce the risk of the
    described scenario happening. When a message is attempted sent via a
    congested link, we now let it be added to the link's backlog queue
    anyway, thus permitting an oversubscription of one message per source
    socket. We still create a wakeup item and return an error code, hence
    instructing the sender to block or stop sending. Only when enough space
    has been freed up in the link's backlog queue do we issue a wakeup event
    that allows the sender to continue with the next message, if any.

    The fact that a socket now can consider a message sent even when the
    link returns a congestion code means that the sending socket code can
    be simplified. Also, since this is a good opportunity to get rid of the
    obsolete 'mtu change' condition in the three socket send functions, we
    now choose to refactor those functions completely.

    Signed-off-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

28 Nov, 2016

1 commit

  • In commit e4bf4f76962b ("tipc: simplify packet sequence number
    handling") we changed the internal representation of the packet
    sequence number counters from u32 to u16, reflecting what is really
    sent over the wire.

    Since then some link statistics counters have been displaying incorrect
    values, partially because the counters meant to be used as sequence
    number snapshots are now used as direct counters, stored as u32, and
    partially because some counter updates are just missing in the code.

    In this commit we correct this in two ways. First, we base the
    displayed packet sent/received values on direct counters instead
    of as previously a calculated difference between current sequence
    number and a snapshot. Second, we add the missing updates of the
    counters.

    This change is compatible with the current netlink API, and requires
    no changes to the user space tools.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

26 Nov, 2016

1 commit

  • commit 817298102b0b ("tipc: fix link priority propagation") introduced a
    compatibility problem between TIPC versions newer than Linux 4.6 and
    those older than Linux 4.4. In versions later than 4.4, link STATE
    messages only contain a non-zero link priority value when the sender
    wants the receiver to change its priority. This has the effect that the
    receiver resets itself in order to apply the new priority. This works
    well, and is consistent with the said commit.

    However, in versions older than 4.4 a valid link priority is present in
    all sent link STATE messages, leading to cyclic link establishment and
    reset on the 4.6+ node.

    We fix this by adding a test that the received value should not only
    be valid, but also differ from the current value in order to cause the
    receiving link endpoint to reset.

    Reported-by: Amar Nv
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

30 Oct, 2016

1 commit

  • In commit 2d18ac4ba745 ("tipc: extend broadcast link initialization
    criteria") we tried to fix a problem with the initial synchronization
    of broadcast link acknowledge values. Unfortunately that solution is
    not sufficient to solve the issue.

    We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
    non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
    initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
    broadcast acknowledge numbers, leading to premature opening of the
    broadcast link. When the bypassed packets finally arrive, they are
    inadvertently accepted, and the already correctly initialized
    acknowledge number in the broadcast receive link is overwritten by
    the invalid (zero) value of the said packets. After this the broadcast
    link goes stale.

    We now fix this by marking the packets where we know the acknowledge
    value is or may be invalid, and then ignoring the acks from those.

    To this purpose, we claim an unused bit in the header to indicate that
    the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
    synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
    packets, plus those LINK_PROTOCOL packets sent out before the broadcast
    links are fully synchronized.

    This minor protocol update is fully backwards compatible.

    Reported-by: John Thompson
    Tested-by: John Thompson
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

03 Sep, 2016

3 commits

  • Because of the risk of an excessive number of NACK messages and
    retransissions, receivers have until now abstained from sending
    broadcast NACKS directly upon detection of a packet sequence number
    gap. We have instead relied on such gaps being detected by link
    protocol STATE message exchange, something that by necessity delays
    such detection and subsequent retransmissions.

    With the introduction of unicast NACK transmission and rate control
    of retransmissions we can now remove this limitation. We now allow
    receiving nodes to send NACKS immediately, while coordinating the
    permission to do so among the nodes in order to avoid NACK storms.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As cluster sizes grow, so does the amount of identical or overlapping
    broadcast NACKs generated by the packet receivers. This often leads to
    'NACK crunches' resulting in huge numbers of redundant retransmissions
    of the same packet ranges.

    In this commit, we introduce rate control of broadcast retransmissions,
    so that a retransmitted range cannot be retransmitted again until after
    at least 10 ms. This reduces the frequency of duplicate, redundant
    retransmissions by an order of magnitude, while having a significant
    positive impact on overall throughput and scalability.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • When we send broadcasts in clusters of more 70-80 nodes, we sometimes
    see the broadcast link resetting because of an excessive number of
    retransmissions. This is caused by a combination of two factors:

    1) A 'NACK crunch", where loss of broadcast packets is discovered
    and NACK'ed by several nodes simultaneously, leading to multiple
    redundant broadcast retransmissions.

    2) The fact that the NACKS as such also are sent as broadcast, leading
    to excessive load and packet loss on the transmitting switch/bridge.

    This commit deals with the latter problem, by moving sending of
    broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
    to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
    bits in word 8 of the said message for this purpose, and introduce a
    new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
    backwards compatible.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

19 Aug, 2016

1 commit

  • When a link is attempted woken up after congestion, it uses a different,
    more generous criteria than when it was originally declared congested.
    This has the effect that the link, and the sending process, sometimes
    will be woken up unnecessarily, just to immediately return to congestion
    when it turns out there is not not enough space in its send queue to
    host the pending message. This is a waste of CPU cycles.

    We now change the function link_prepare_wakeup() to use exactly the same
    criteria as tipc_link_xmit(). However, since we are now excluding the
    window limit from the wakeup calculation, and the current backlog limit
    for the lowest level is too small to house even a single maximum-size
    message, we have to expand this limit. We do this by evaluating an
    alternative, minimum value during the setting of the importance limits.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

24 Jul, 2016

1 commit


12 Jul, 2016

2 commits

  • After a new receiver peer has been added to the broadcast transmission
    link, we allow immediate transmission of new broadcast packets, trusting
    that the new peer will not accept the packets until it has received the
    previously sent unicast broadcast initialiation message. In the same
    way, the sender must not accept any acknowledges until it has itself
    received the broadcast initialization from the peer, as well as
    confirmation of the reception of its own initialization message.

    Furthermore, when a receiver peer goes down, the sender has to produce
    the missing acknowledges from the lost peer locally, in order ensure
    correct release of the buffers that were expected to be acknowledged by
    the said peer.

    In a highly stressed system we have observed that contact with a peer
    may come up and be lost before the above mentioned broadcast initial-
    ization and confirmation have been received. This leads to the locally
    produced acknowledges being rejected, and the non-acknowledged buffers
    to linger in the broadcast link transmission queue until it fills up
    and the link goes into permanent congestion.

    In this commit, we remedy this by temporarily setting the corresponding
    broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
    state to true before we issue the local acknowledges. This ensures that
    those acknowledges will always be accepted. The mentioned state values
    are restored immediately afterwards when the link is reset.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • At first contact between two nodes, an endpoint might sometimes have
    time to send out a LINK_PROTOCOL/STATE packet before it has received
    the broadcast initialization packet from the peer, i.e., before it has
    received a valid broadcast packet number to add to the 'bc_ack' field
    of the protocol message.

    This means that the peer endpoint will receive a protocol packet with an
    invalid broadcast acknowledge value of 0. Under unlucky circumstances
    this may lead to the original, already received acknowledge value being
    overwritten, so that the whole broadcast link goes stale after a while.

    We fix this by delaying the setting of the link field 'bc_peer_is_up'
    until we know that the peer really has received our own broadcast
    initialization message. The latter is always sent out as the first
    unicast message on a link, and always with seqeunce number 1. Because
    of this, we only need to look for a non-zero unicast acknowledge value
    in the arriving STATE messages, and once that is confirmed we know we
    are safe and can set the mentioned field. Before this moment, we must
    ignore all broadcast acknowledges from the peer.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

30 Jun, 2016

1 commit


16 Jun, 2016

2 commits

  • net/tipc/link.c: In function ‘tipc_link_timeout’:
    net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]

    Fixes: 42b18f605fea ("tipc: refactor function tipc_link_timeout()")
    Acked-by: Jon Maloy
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     
  • TIPC based clusters are by default set up with full-mesh link
    connectivity between all nodes. Those links are expected to provide
    a short failure detection time, by default set to 1500 ms. Because
    of this, the background load for neighbor monitoring in an N-node
    cluster increases with a factor N on each node, while the overall
    monitoring traffic through the network infrastructure increases at
    a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
    scale well beyond ~100 nodes unless we significantly increase failure
    discovery tolerance.

    This commit introduces a framework and an algorithm that drastically
    reduces this background load, while basically maintaining the original
    failure detection times across the whole cluster. Using this algorithm,
    background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
    at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
    now have to actively monitor 38 neighbors in a 400-node cluster, instead
    of as before 399.

    This "Overlapping Ring Supervision Algorithm" is completely distributed
    and employs no centralized or coordinated state. It goes as follows:

    - Each node makes up a linearly ascending, circular list of all its N
    known neighbors, based on their TIPC node identity. This algorithm
    must be the same on all nodes.

    - The node then selects the next M = sqrt(N) - 1 nodes downstream from
    itself in the list, and chooses to actively monitor those. This is
    called its "local monitoring domain".

    - It creates a domain record describing the monitoring domain, and
    piggy-backs this in the data area of all neighbor monitoring messages
    (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
    the cluster eventually (default within 400 ms) will learn about
    its monitoring domain.

    - Whenever a node discovers a change in its local domain, e.g., a node
    has been added or has gone down, it creates and sends out a new
    version of its node record to inform all neighbors about the change.

    - A node receiving a domain record from anybody outside its local domain
    matches this against its own list (which may not look the same), and
    chooses to not actively monitor those members of the received domain
    record that are also present in its own list. Instead, it relies on
    indications from the direct monitoring nodes if an indirectly
    monitored node has gone up or down. If a node is indicated lost, the
    receiving node temporarily activates its own direct monitoring towards
    that node in order to confirm, or not, that it is actually gone.

    - Since each node is actively monitoring sqrt(N) downstream neighbors,
    each node is also actively monitored by the same number of upstream
    neighbors. This means that all non-direct monitoring nodes normally
    will receive sqrt(N) indications that a node is gone.

    - A major drawback with ring monitoring is how it handles failures that
    cause massive network partitionings. If both a lost node and all its
    direct monitoring neighbors are inside the lost partition, the nodes in
    the remaining partition will never receive indications about the loss.
    To overcome this, each node also chooses to actively monitor some
    nodes outside its local domain. Those nodes are called remote domain
    "heads", and are selected in such a way that no node in the cluster
    will be more than two direct monitoring hops away. Because of this,
    each node, apart from monitoring the member of its local domain, will
    also typically monitor sqrt(N) remote head nodes.

    - As an optimization, local list status, domain status and domain
    records are marked with a generation number. This saves senders from
    unnecessarily conveying unaltered domain records, and receivers from
    performing unneeded re-adaptations of their node monitoring list, such
    as re-assigning domain heads.

    - As a measure of caution we have added the possibility to disable the
    new algorithm through configuration. We do this by keeping a threshold
    value for the cluster size; a cluster that grows beyond this value
    will switch from full-mesh to ring monitoring, and vice versa when
    it shrinks below the value. This means that if the threshold is set to
    a value larger than any anticipated cluster size (default size is 32)
    the new algorithm is effectively disabled. A patch set for altering the
    threshold value and for listing the table contents will follow shortly.

    - This change is fully backwards compatible.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

09 Jun, 2016

1 commit

  • The node keepalive interval is recalculated at each timer expiration
    to catch any changes in the link tolerance, and stored in a field in
    struct tipc_node. We use jiffies as unit for the stored value.

    This is suboptimal, because it makes the calculation unnecessary
    complex, including two unit conversions. The conversions also lead to
    a rounding error that causes the link "abort limit" to be 3 in the
    normal case, instead of 4, as intended. This again leads to unnecessary
    link resets when the network is pushed close to its limit, e.g., in an
    environment with hundreds of nodes or namesapces.

    In this commit, we do instead let the keepalive value be calculated and
    stored in milliseconds, so that there is only one conversion and the
    rounding error is eliminated.

    We also remove a redundant "keepalive" field in struct tipc_link. This
    is remnant from the previous implementation.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

25 Apr, 2016

1 commit

  • Commit 42b18f605fea ("tipc: refactor function tipc_link_timeout()"),
    introduced a bug which prevents sending of probe messages during
    link synchronization phase. This leads to hanging links, if the
    bearer is disabled/enabled after links are up.

    In this commit, we send the probe messages correctly.

    Fixes: 42b18f605fea ("tipc: refactor function tipc_link_timeout()")
    Acked-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

16 Apr, 2016

4 commits

  • According to the link FSM, a received traffic packet can take a link
    from state ESTABLISHING to ESTABLISHED, but the link can still not be
    fully set up in one atomic operation. This means that even if the the
    very first packet on the link is a traffic packet with sequence number
    1 (one), it has to be dropped and retransmitted.

    This can be avoided if we let the mentioned packet be preceded by a
    LINK_PROTOCOL/STATE message, which takes up the endpoint before the
    arrival of the traffic.

    We add this small feature in this commit.

    This is a fully compatible change.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The function tipc_link_timeout() is unnecessary complex, and can
    easily be made more readable.

    We do that with this commit. The only functional change is that we
    remove a redundant test for whether the broadcast link is up or not.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • When a link is down, it will continuously try to re-establish contact
    with the peer by sending out a RESET or an ACTIVATE message at each
    timeout interval. The default value for this interval is currently
    375 ms. This is wasteful, and may become a problem in very large
    clusters with dozens or hundreds of nodes being down simultaneously.

    We now introduce a simple backoff algorithm for these cases. The
    first five messages are sent at default rate; thereafter a message
    is sent only each 16th timer interval.

    This will cover the vast majority of link recycling cases, since the
    endpoint starting last will transmit at the higher speed, and the link
    should normally be established well be before the rate needs to be
    reduced.

    The only case where we will see a degradation of link re-establishment
    times is when the endpoints remain intact, and a glitch in the
    transmission media is causing the link reset. We will then experience
    a worst-case re-establishing time of 6 seconds, something we deem
    acceptable.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • When a link endpoint is going down locally, e.g., because its interface
    is being stopped, it will spontaneously send out a RESET message to
    its peer, informing it about this fact. This saves the peer from
    detecting the failure via probing, and hence gives both speedier and
    less resource consuming failure detection on the peer side.

    According to the link FSM, a receiver of a RESET message, ignoring the
    reason for it, must now consider the sender ready to come back up, and
    starts periodically sending out ACTIVATE messages to the peer in order
    to re-establish the link. Also, according to the FSM, the receiver of
    an ACTIVATE message can now go directly to state ESTABLISHED and start
    sending regular traffic packets. This is a well-proven and robust FSM.

    However, in the case of a reboot, there is a small possibilty that link
    endpoint on the rebooted node may have been re-created with a new bearer
    identity between the moment it sent its (pre-boot) RESET and the moment
    it receives the ACTIVATE from the peer. The new bearer identity cannot
    be known by the peer according to this scenario, since traffic headers
    don't convey such information. This is a problem, because both endpoints
    need to know the correct value of the peer's bearer id at any moment in
    time in order to be able to produce correct link events for their users.

    The only way to guarantee this is to enforce a full setup message
    exchange (RESET + ACTIVATE) even after the reboot, since those messages
    carry the bearer idientity in their header.

    In this commit we do this by introducing and setting a "stopping" bit in
    the header of the spontaneously generated RESET messages, informing the
    peer that the sender will not be immediately ready to re-establish the
    link. A receiver seeing this bit must act as if this were a locally
    detected connectivity failure, and hence has to go through a full two-
    way setup message exchange before any link can be re-established.

    Although never reported, this problem seems to have always been around.

    This protocol addition is fully backwards compatible.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

08 Mar, 2016

1 commit


07 Mar, 2016

1 commit

  • Until now, we have kept a pre-allocated protocol message header
    aggregated into struct tipc_link. Apart from adding unnecessary
    footprint to the link instances, this requires extra code both to
    initialize and re-initialize it.

    We now remove this sub-optimization. This change also makes it
    possible to clean up the function tipc_build_proto_msg() and remove
    a couple of small functions that were accessing the mentioned header.
    In particular, we can replace all occurrences of the local function
    call link_own_addr(link) with the generic tipc_own_addr(net).

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

23 Feb, 2016

1 commit


20 Feb, 2016

1 commit


17 Feb, 2016

1 commit


06 Feb, 2016

2 commits

  • Currently link priority changes isn't handled for active links. In
    this patch we resolve this by changing our priority if the peer passes
    a valid priority in a state message.

    Reviewed-by: Jon Maloy
    Signed-off-by: Richard Alpe
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Changing certain link attributes (link tolerance and link priority)
    from the TIPC management tool is supposed to automatically take
    effect at both endpoints of the affected link.

    Currently the media address is not instantiated for the link and is
    used uninstantiated when crafting protocol messages designated for the
    peer endpoint. This means that changing a link property currently
    results in the property being changed on the local machine but the
    protocol message designated for the peer gets lost. Resulting in
    property discrepancy between the endpoints.

    In this patch we resolve this by using the media address from the
    link entry and using the bearer transmit function to send it. Hence,
    we can now eliminate the redundant function tipc_link_prot_xmit() and
    the redundant field tipc_link::media_addr.

    Fixes: 2af5ae372a4b (tipc: clean up unused code and structures)
    Reviewed-by: Jon Maloy
    Reported-by: Jason Hu
    Signed-off-by: Richard Alpe
    Signed-off-by: David S. Miller

    Richard Alpe
     

04 Dec, 2015

1 commit


21 Nov, 2015

6 commits

  • Since commit 5266698661401afc5e ("tipc: let broadcast packet
    reception use new link receive function") the broadcast send
    link state was meant to always be set to LINK_ESTABLISHED, since
    we don't need this link to follow the regular link FSM rules. It
    was also the intention that this state anyway shouldn't impact
    the run-time working state of the link, since the latter in
    reality is controlled by the number of registered peers.

    We have now discovered that this assumption is not quite correct.
    If the broadcast link is reset because of too many retransmissions,
    its state will inadvertently go to LINK_RESETTING, and never go
    back to LINK_ESTABLISHED, because the LINK_FAILURE event was not
    anticipated. This will work well once, but if it happens a second
    time, the reset on a link in LINK_RESETTING has has no effect, and
    neither the broadcast link nor the unicast links will go down as
    they should.

    Furthermore, it is confusing that the management tool shows that
    this link is in UP state when that obviously isn't the case.

    We now ensure that this state strictly follows the true working
    state of the link. The state is set to LINK_ESTABLISHED when
    the number of peers is non-zero, and to LINK_RESET otherwise.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The number of variables with Hungarian notation (l_ptr, n_ptr etc.)
    has been significantly reduced over the last couple of years.

    We now root out the last traces of this practice.
    There are no functional changes in this commit.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • We move the definition of struct tipc_link from link.h to link.c in
    order to minimize its exposure to the rest of the code.

    When needed, we define new functions to make it possible for external
    entities to access and set data in the link.

    Apart from the above, there are no functional changes.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • In our effort to have less code and include dependencies between
    entities such as node, link and bearer, we try to narrow down
    the exposed interface towards the node as much as possible.

    In this commit, we move the definition of struct tipc_node, along
    with many of its associated function declarations, from node.h to
    node.c. We also move some function definitions from link.c and
    name_distr.c to node.c, since they access fields in struct tipc_node
    that should not be externally visible. The moved functions are renamed
    according to new location, and made static whenever possible.

    There are no functional changes in this commit.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • According to the node FSM a node in state SELF_UP_PEER_UP cannot
    change state inside a lock context, except when a TUNNEL_PROTOCOL
    (SYNCH or FAILOVER) packet arrives. However, the node's individual
    links may still change state.

    Since each link now is protected by its own spinlock, we finally have
    the conditions in place to convert the node spinlock to an rwlock_t.
    If the node state and arriving packet type are rigth, we can let the
    link directly receive the packet under protection of its own spinlock
    and the node lock in read mode. In all other cases we use the node
    lock in write mode. This enables full concurrent execution between
    parallel links during steady-state traffic situations, i.e., 99+ %
    of the time.

    This commit implements this change.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As a preparation to allow parallel links to work more independently
    from each other we introduce a per-link spinlock, to be stored in the
    struct nodes's link entry area. Since the node lock still is a regular
    spinlock there is no increase in parallellism at this stage.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy