23 Jan, 2020

1 commit

  • commit dca4a17d24ee9d878836ce5eb8dc25be1ffa5729 upstream.

    In commit c55c8edafa91 ("tipc: smooth change between replicast and
    broadcast"), we allow instant switching between replicast and broadcast
    by sending a dummy 'SYN' packet on the last used link to synchronize
    packets on the links. The 'SYN' message is an object of link congestion
    also, so if that happens, a 'SOCK_WAKEUP' will be scheduled to be sent
    back to the socket...
    However, in that commit, we simply use the same socket 'cong_link_cnt'
    counter for both the 'SYN' & normal payload message sending. Therefore,
    if both the replicast & broadcast links are congested, the counter will
    be not updated correctly but overwritten by the latter congestion.
    Later on, when the 'SOCK_WAKEUP' messages are processed, the counter is
    reduced one by one and eventually overflowed. Consequently, further
    activities on the socket will only wait for the false congestion signal
    to disappear but never been met.

    Because sending the 'SYN' message is vital for the mechanism, it should
    be done anyway. This commit fixes the issue by marking the message with
    an error code e.g. 'TIPC_ERR_NO_PORT', so its sending should not face a
    link congestion, there is no need to touch the socket 'cong_link_cnt'
    either. In addition, in the event of any error (e.g. -ENOBUFS), we will
    purge the entire payload message queue and make a return immediately.

    Fixes: c55c8edafa91 ("tipc: smooth change between replicast and broadcast")
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Tuong Lien
     

19 Aug, 2019

1 commit

  • The policy for handling the skb list locks on the send and receive paths
    is simple.

    - On the send path we never need to grab the lock on the 'xmitq' list
    when the destination is an exernal node.

    - On the receive path we always need to grab the lock on the 'inputq'
    list, irrespective of source node.

    However, when transmitting node local messages those will eventually
    end up on the receive path of a local socket, meaning that the argument
    'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the
    function tipc_sk_rcv(). This has been handled by always initializing
    the spinlock of the 'xmitq' list at message creation, just in case it
    may end up on the receive path later, and despite knowing that the lock
    in most cases never will be used.

    This approach is inaccurate and confusing, and has also concealed the
    fact that the stated 'no lock grabbing' policy for the send path is
    violated in some cases.

    We now clean up this by never initializing the lock at message creation,
    instead doing this at the moment we find that the message actually will
    enter the receive path. At the same time we fix the four locations
    where we incorrectly access the spinlock on the send/error path.

    This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock
    is initialised") which has now become redundant.

    CC: Eric Dumazet
    Reported-by: Chris Packham
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Reviewed-by: Xin Long
    Signed-off-by: David S. Miller

    Jon Maloy
     

09 Aug, 2019

1 commit

  • Since node internal messages are passed directly to the socket, it is not
    possible to observe those messages via tcpdump or wireshark.

    We now remedy this by making it possible to clone such messages and send
    the clones to the loopback interface. The clones are dropped at reception
    and have no functional role except making the traffic visible.

    The feature is enabled if network taps are active for the loopback device.
    pcap filtering restrictions require the messages to be presented to the
    receiving side of the loopback device.

    v3 - Function dev_nit_active used to check for network taps.
    - Procedure netif_rx_ni used to send cloned messages to loopback device.

    Signed-off-by: John Rutherford
    Acked-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    John Rutherford
     

26 Jun, 2019

1 commit


05 Apr, 2019

1 commit

  • skb somehow dequeued out of inputq before processing, it causes to
    NULL pointer and kernel crashed.

    Add checking skb valid before using.

    Fixes: c55c8edafa9 ("tipc: smooth change between replicast and broadcast")
    Reported-by: Tuong Lien Tong
    Acked-by: Ying Xue
    Signed-off-by: Hoang Le
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Hoang Le
     

27 Mar, 2019

1 commit


22 Mar, 2019

1 commit

  • In commit c55c8edafa91 ("tipc: smooth change between replicast and
    broadcast") we introduced new method to eliminate the risk of message
    reordering that happen in between different nodes.
    Unfortunately, we forgot checking at receiving side to ignore intra node.

    We fix this by checking and returning if arrived message from intra node.

    syzbot report:

    ==================================================================
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
    Hardware name: Google Google Compute Engine/Google Compute Engine,
    BIOS Google 01/01/2011
    RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
    Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
    24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
    3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
    RSP: 0018:ffff8880959defc8 EFLAGS: 00010202
    RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000
    RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8
    RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8
    R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000
    R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48
    FS: 000000000106a880(0000) GS:ffff8880ae800000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
    tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
    tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
    tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
    tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
    tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
    __tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
    tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
    sock_sendmsg_nosec net/socket.c:651 [inline]
    sock_sendmsg+0xdd/0x130 net/socket.c:661
    ___sys_sendmsg+0x806/0x930 net/socket.c:2260
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2298
    __do_sys_sendmsg net/socket.c:2307 [inline]
    __se_sys_sendmsg net/socket.c:2305 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4401c9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
    48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
    3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
    RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50
    R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    ---[ end trace ba79875754e1708f ]---

    Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
    Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

20 Mar, 2019

2 commits

  • Currently, a multicast stream may start out using replicast, because
    there are few destinations, and then it should ideally switch to
    L2/broadcast IGMP/multicast when the number of destinations grows beyond
    a certain limit. The opposite should happen when the number decreases
    below the limit.

    To eliminate the risk of message reordering caused by method change,
    a sending socket must stick to a previously selected method until it
    enters an idle period of 5 seconds. Means there is a 5 seconds pause
    in the traffic from the sender socket.

    If the sender never makes such a pause, the method will never change,
    and transmission may become very inefficient as the cluster grows.

    With this commit, we allow such a switch between replicast and
    broadcast without any need for a traffic pause.

    Solution is to send a dummy message with only the header, also with
    the SYN bit set, via broadcast or replicast. For the data message,
    the SYN bit is set and sending via replicast or broadcast (inverse
    method with dummy).

    Then, at receiving side any messages follow first SYN bit message
    (data or dummy message), they will be held in deferred queue until
    another pair (dummy or data message) arrived in other link.

    v2: reverse christmas tree declaration

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     
  • Currently, a multicast stream uses either broadcast or replicast as
    transmission method, based on the ratio between number of actual
    destinations nodes and cluster size.

    However, when an L2 interface (e.g., VXLAN) provides pseudo
    broadcast support, this becomes very inefficient, as it blindly
    replicates multicast packets to all cluster/subnet nodes,
    irrespective of whether they host actual target sockets or not.

    The TIPC multicast algorithm is able to distinguish real destination
    nodes from other nodes, and hence provides a smarter and more
    efficient method for transferring multicast messages than
    pseudo broadcast can do.

    Because of this, we now make it possible for users to force
    the broadcast link to permanently switch to using replicast,
    irrespective of which capabilities the bearer provides,
    or pretend to provide.
    Conversely, we also make it possible to force the broadcast link
    to always use true broadcast. While maybe less useful in
    deployed systems, this may at least be useful for testing the
    broadcast algorithm in small clusters.

    We retain the current AUTOSELECT ability, i.e., to let the broadcast link
    automatically select which algorithm to use, and to switch back and forth
    between broadcast and replicast as the ratio between destination
    node number and cluster size changes. This remains the default method.

    Furthermore, we make it possible to configure the threshold ratio for
    such switches. The default ratio is now set to 10%, down from 25% in the
    earlier implementation.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

04 Sep, 2018

1 commit


28 Jul, 2018

1 commit


08 Mar, 2018

1 commit


02 Dec, 2017

1 commit

  • When sending node local messages the code is using an 'mtu' of 66060
    bytes to avoid unnecessary fragmentation. During situations of low
    memory tipc_msg_build() may sometimes fail to allocate such large
    buffers, resulting in unnecessary send failures. This can easily be
    remedied by falling back to a smaller MTU, and then reassemble the
    buffer chain as if the message were arriving from a remote node.

    At the same time, we change the initial MTU setting of the broadcast
    link to a lower value, so that large messages always are fragmented
    into smaller buffers even when we run in single node mode. Apart from
    obtaining the same advantage as for the 'fallback' solution above, this
    turns out to give a significant performance improvement. This can
    probably be explained with the __pskb_copy() operation performed on the
    buffer for each recipient during reception. We found the optimal value
    for this, considering the most relevant skb pool, to be 3744 bytes.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

13 Oct, 2017

1 commit

  • We often see a need for a linked list of destination identities,
    sometimes containing a port number, sometimes a node identity, and
    sometimes both. The currently defined struct u32_list is not generic
    enough to cover all cases, so we extend it to contain two u32 integers
    and rename it to struct tipc_dest_list.

    Signed-off-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Maloy
     

09 Oct, 2017

1 commit

  • We change the initialization of the skb transmit buffer queues
    in the functions tipc_bcast_xmit() and tipc_rcast_xmit() to also
    initialize their spinlocks. This is needed because we may, during
    error conditions, need to call skb_queue_purge() on those queues
    further down the stack.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

21 Jan, 2017

4 commits

  • If the bearer carrying multicast messages supports broadcast, those
    messages will be sent to all cluster nodes, irrespective of whether
    these nodes host any actual destinations socket or not. This is clearly
    wasteful if the cluster is large and there are only a few real
    destinations for the message being sent.

    In this commit we extend the eligibility of the newly introduced
    "replicast" transmit option. We now make it possible for a user to
    select which method he wants to be used, either as a mandatory setting
    via setsockopt(), or as a relative setting where we let the broadcast
    layer decide which method to use based on the ratio between cluster
    size and the message's actual number of destination nodes.

    In the latter case, a sending socket must stick to a previously
    selected method until it enters an idle period of at least 5 seconds.
    This eliminates the risk of message reordering caused by method change,
    i.e., when changes to cluster size or number of destinations would
    otherwise mandate a new method to be used.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • TIPC multicast messages are currently carried over a reliable
    'broadcast link', making use of the underlying media's ability to
    transport packets as L2 broadcast or IP multicast to all nodes in
    the cluster.

    When the used bearer is lacking that ability, we can instead emulate
    the broadcast service by replicating and sending the packets over as
    many unicast links as needed to reach all identified destinations.
    We now introduce a new TIPC link-level 'replicast' service that does
    this.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As a further preparation for the upcoming 'replicast' functionality,
    we add some necessary structs and functions for looking up and returning
    a list of all nodes that host destinations for a given multicast message.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As a preparation for the 'replicast' functionality we are going to
    introduce in the next commits, we need the broadcast base structure to
    store whether bearer broadcast is available at all from the currently
    used bearer or bearers.

    We do this by adding a new function tipc_bearer_bcast_support() to
    the bearer layer, and letting the bearer selection function in
    bcast.c use this to give a new boolean field, 'bcast_support' the
    appropriate value.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

04 Jan, 2017

1 commit

  • The socket code currently handles link congestion by either blocking
    and trying to send again when the congestion has abated, or just
    returning to the user with -EAGAIN and let him re-try later.

    This mechanism is prone to starvation, because the wakeup algorithm is
    non-atomic. During the time the link issues a wakeup signal, until the
    socket wakes up and re-attempts sending, other senders may have come
    in between and occupied the free buffer space in the link. This in turn
    may lead to a socket having to make many send attempts before it is
    successful. In extremely loaded systems we have observed latency times
    of several seconds before a low-priority socket is able to send out a
    message.

    In this commit, we simplify this mechanism and reduce the risk of the
    described scenario happening. When a message is attempted sent via a
    congested link, we now let it be added to the link's backlog queue
    anyway, thus permitting an oversubscription of one message per source
    socket. We still create a wakeup item and return an error code, hence
    instructing the sender to block or stop sending. Only when enough space
    has been freed up in the link's backlog queue do we issue a wakeup event
    that allows the sender to continue with the next message, if any.

    The fact that a socket now can consider a message sent even when the
    link returns a congestion code means that the sending socket code can
    be simplified. Also, since this is a good opportunity to get rid of the
    obsolete 'mtu change' condition in the three socket send functions, we
    now choose to refactor those functions completely.

    Signed-off-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

30 Oct, 2016

1 commit

  • In commit 2d18ac4ba745 ("tipc: extend broadcast link initialization
    criteria") we tried to fix a problem with the initial synchronization
    of broadcast link acknowledge values. Unfortunately that solution is
    not sufficient to solve the issue.

    We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
    non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
    initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
    broadcast acknowledge numbers, leading to premature opening of the
    broadcast link. When the bypassed packets finally arrive, they are
    inadvertently accepted, and the already correctly initialized
    acknowledge number in the broadcast receive link is overwritten by
    the invalid (zero) value of the said packets. After this the broadcast
    link goes stale.

    We now fix this by marking the packets where we know the acknowledge
    value is or may be invalid, and then ignoring the acks from those.

    To this purpose, we claim an unused bit in the header to indicate that
    the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
    synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
    packets, plus those LINK_PROTOCOL packets sent out before the broadcast
    links are fully synchronized.

    This minor protocol update is fully backwards compatible.

    Reported-by: John Thompson
    Tested-by: John Thompson
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

03 Sep, 2016

1 commit

  • When we send broadcasts in clusters of more 70-80 nodes, we sometimes
    see the broadcast link resetting because of an excessive number of
    retransmissions. This is caused by a combination of two factors:

    1) A 'NACK crunch", where loss of broadcast packets is discovered
    and NACK'ed by several nodes simultaneously, leading to multiple
    redundant broadcast retransmissions.

    2) The fact that the NACKS as such also are sent as broadcast, leading
    to excessive load and packet loss on the transmitting switch/bridge.

    This commit deals with the latter problem, by moving sending of
    broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
    to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
    bits in word 8 of the said message for this purpose, and introduce a
    new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
    backwards compatible.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

07 Mar, 2016

1 commit

  • Until now, we have kept a pre-allocated protocol message header
    aggregated into struct tipc_link. Apart from adding unnecessary
    footprint to the link instances, this requires extra code both to
    initialize and re-initialize it.

    We now remove this sub-optimization. This change also makes it
    possible to clean up the function tipc_build_proto_msg() and remove
    a couple of small functions that were accessing the mentioned header.
    In particular, we can replace all occurrences of the local function
    call link_own_addr(link) with the generic tipc_own_addr(net).

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

21 Nov, 2015

1 commit

  • We move the definition of struct tipc_link from link.h to link.c in
    order to minimize its exposure to the rest of the code.

    When needed, we define new functions to make it possible for external
    entities to access and set data in the link.

    Apart from the above, there are no functional changes.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

24 Oct, 2015

11 commits

  • After the previous changes in this series, we can now remove some
    unused code and structures, both in the broadcast, link aggregation
    and link code.

    There are no functional changes in this commit.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • With the recent commit series, we have established a one-way dependency
    between the link aggregation (struct tipc_node) instances and their
    pertaining tipc_link instances. This has enabled quite significant code
    and structure simplifications.

    In this commit, we eliminate the field 'owner', which points to an
    instance of struct tipc_node, from struct tipc_link, and replace it with
    a pointer to struct net, which is the only external reference now needed
    by a link instance.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Until now, we have only been supporting a fix MTU size of 1500 bytes
    for all broadcast media, irrespective of their actual capability.

    We now make the broadcast MTU adaptable to the carrying media, i.e.,
    we use the smallest MTU supported by any of the interfaces attached
    to TIPC.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Until now, we have been keeping track of the exact set of broadcast
    destinations though the help structure tipc_node_map. This leads us to
    have to maintain a whole infrastructure for supporting this, including
    a pseudo-bearer and a number of functions to manipulate both the bearers
    and the node map correctly. Apart from the complexity, this approach is
    also limiting, as struct tipc_node_map only can support cluster local
    broadcast if we want to avoid it becoming excessively large. We want to
    eliminate this limitation, in order to enable introduction of scoped
    multicast in the future.

    A closer analysis reveals that it is unnecessary maintaining this "full
    set" overview; it is sufficient to keep a counter per bearer, indicating
    how many nodes can be reached via this bearer at the moment. The protocol
    is now robust enough to handle transitional discrepancies between the
    nominal number of reachable destinations, as expected by the broadcast
    protocol itself, and the number which is actually reachable at the
    moment. The initial broadcast synchronization, in conjunction with the
    retransmission mechanism, ensures that all packets will eventually be
    acknowledged by the correct set of destinations.

    This commit introduces these changes.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The code path for receiving broadcast packets is currently distinct
    from the unicast path. This leads to unnecessary code and data
    duplication, something that can be avoided with some effort.

    We now introduce separate per-peer tipc_link instances for handling
    broadcast packet reception. Each receive link keeps a pointer to the
    common, single, broadcast link instance, and can hence handle release
    and retransmission of send buffers as if they belonged to the own
    instance.

    Furthermore, we let each unicast link instance keep a reference to both
    the pertaining broadcast receive link, and to the common send link.
    This makes it possible for the unicast links to easily access data for
    broadcast link synchronization, as well as for carrying acknowledges for
    received broadcast packets.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Until now, we have tried to support both the newer, dedicated broadcast
    synchronization mechanism along with the older, less safe, RESET_MSG/
    ACTIVATE_MSG based one. The latter method has turned out to be a hazard
    in a highly dynamic cluster, so we find it safer to disable it completely
    when we find that the former mechanism is supported by the peer node.

    For this purpose, we now introduce a new capabability bit,
    TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast
    syncronization is supported by the present node. The new bit is conveyed
    between peers in the 'capabilities' field of neighbor discovery messages.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • This commit simplifies the broadcast link transmission function, by
    leveraging previous changes to the link transmission function and the
    broadcast transmission link life cycle.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The broadcast link instance (struct tipc_link) used for sending is
    currently aggregated into struct tipc_bclink. This means that we cannot
    use the regular tipc_link_create() function for initiating the link, but
    do instead have to initiate numerous fields directly from the
    bcast_init() function.

    We want to reduce dependencies between the broadcast functionality
    and the inner workings of tipc_link. In this commit, we introduce
    a new function tipc_bclink_create() to link.c, and allocate the
    instance of the link separately using this function.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The broadcast transmission link is currently instantiated when the
    network subsystem is started, i.e., on order from user space via netlink.

    This forces the broadcast transmission code to do unnecessary tests for
    the existence of the transmission link, as well in single mode node as
    in network mode.

    In this commit, we do instead create the link during initialization of
    the name space, and remove it when it is stopped. The fact that the
    transmission link now has a guaranteed longer life cycle than any of its
    potential clients paves the way for further code simplifcations
    and optimizations.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The broadcast lock will need to be acquired outside bcast.c in a later
    commit. For this reason, we move the lock to struct tipc_net. Consistent
    with the changes in the previous commit, we also introducee two new
    functions tipc_bcast_lock() and tipc_bcast_unlock(). The code that is
    currently using tipc_bclink_lock()/unlock() will be phased out during
    the coming commits in this series.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Currently, a number of structure and function definitions related
    to the broadcast functionality are unnecessarily exposed in the file
    bcast.h. This obscures the fact that the external interface towards
    the broadcast link in fact is very narrow, and causes unnecessary
    recompilations of other files when anything changes in those
    definitions.

    In this commit, we move as many of those definitions as is currently
    possible to the file bcast.c.

    We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
    since the name does not correctly describe the contents of this
    struct, and will do so even less in the future, and because we want
    to use the term 'link' more appropriately in the functionality
    introduced later in this series.

    Finally, we rename a couple of functions, such as tipc_bclink_xmit()
    and others that will be kept in the future, to include the term 'bcast'
    instead.

    There are no functional changes in this commit.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

22 Oct, 2015

1 commit

  • The default fix broadcast window size is currently set to 20 packets.
    This is a very low value, set at a time when we were still testing on
    10 Mb/s hubs, and a change to it is long overdue.

    Commit 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure")
    revealed a problem with this low value. For messages of importance LOW,
    the backlog queue limit will be calculated to 30 packets, while a
    single, maximum sized message of 66000 bytes, carried across a 1500 MTU
    network consists of 46 packets.

    This leads to the following scenario (among others leading to the same
    situation):

    1: Msg 1 of 46 packets is sent. 20 packets go to the transmit queue, 26
    packets to the backlog queue.
    2: Msg 2 of 46 packets is attempted sent, but rejected because there is
    no more space in the backlog queue at this level. The sender is added
    to the wakeup queue with a "pending packets chain size" number of 46.
    3: Some packets in the transmit queue are acked and released. We try to
    wake up the sender, but the pending size of 46 is bigger than the LOW
    wakeup limit of 30, so this doesn't happen.
    5: Subsequent acks releases all the remaining buffers. Each time we test
    for the wakeup criteria and find that 46 still is larger than 30,
    even after both the transmit and the backlog queues are empty.
    6: The sender is never woken up and given a chance to send its message.
    He is stuck.

    We could now loosen the wakeup criteria (used by link_prepare_wakeup())
    to become equal to the send criteria (used by tipc_link_xmit()), i.e.,
    by ignoring the "pending packets chain size" value altogether, or we can
    just increase the queue limits so that the criteria can be satisfied
    anyway. There are good reasons (potentially multiple waiting senders) to
    not opt for the former solution, so we choose the latter one.

    This commit fixes the problem by giving the broadcast link window a
    default value of 50 packets. We also introduce a new minimum link
    window size BCLINK_MIN_WIN of 32, which is enough to always avoid the
    described situation. Finally, in order to not break any existing users
    which may set the window explicitly, we enforce that the window is set
    to the new minimum value in case the user is trying to set it to
    anything lower.

    Fixes: 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure")
    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

09 Sep, 2015

1 commit

  • If an attempt to wake up users of broadcast link is made when there is
    no enough place in send queue than it may hang up inside the
    tipc_sk_rcv() function since the loop breaks only after the wake up
    queue becomes empty. This can lead to complete CPU stall with the
    following message generated by RCU:

    INFO: rcu_sched self-detected stall on CPU { 0} (t=2101 jiffies
    g=54225 c=54224 q=11465)
    Task dump for CPU 0:
    tpch R running task 0 39949 39948 0x0000000a
    ffffffff818536c0 ffff88181fa037a0 ffffffff8106a4be 0000000000000000
    ffffffff818536c0 ffff88181fa037c0 ffffffff8106d8a8 ffff88181fa03800
    0000000000000001 ffff88181fa037f0 ffffffff81094a50 ffff88181fa15680
    Call Trace:
    [] sched_show_task+0xae/0x120
    [] dump_cpu_task+0x38/0x40
    [] rcu_dump_cpu_stacks+0x90/0xd0
    [] rcu_check_callbacks+0x3eb/0x6e0
    [] ? account_system_time+0x7f/0x170
    [] update_process_times+0x34/0x60
    [] tick_sched_handle.isra.18+0x31/0x40
    [] tick_sched_timer+0x3c/0x70
    [] __run_hrtimer.isra.34+0x3d/0xc0
    [] hrtimer_interrupt+0xc5/0x1e0
    [] ? native_smp_send_reschedule+0x42/0x60
    [] local_apic_timer_interrupt+0x34/0x60
    [] smp_apic_timer_interrupt+0x3c/0x60
    [] apic_timer_interrupt+0x6b/0x70
    [] ? _raw_spin_unlock_irqrestore+0x9/0x10
    [] __wake_up_sync_key+0x4f/0x60
    [] tipc_write_space+0x31/0x40 [tipc]
    [] filter_rcv+0x31f/0x520 [tipc]
    [] ? tipc_sk_lookup+0xc9/0x110 [tipc]
    [] ? _raw_spin_lock_bh+0x19/0x30
    [] tipc_sk_rcv+0x2dc/0x3e0 [tipc]
    [] tipc_bclink_wakeup_users+0x2f/0x40 [tipc]
    [] tipc_node_unlock+0x186/0x190 [tipc]
    [] ? kfree_skb+0x2c/0x40
    [] tipc_rcv+0x2ac/0x8c0 [tipc]
    [] tipc_l2_rcv_msg+0x38/0x50 [tipc]
    [] __netif_receive_skb_core+0x5a3/0x950
    [] __netif_receive_skb+0x13/0x60
    [] netif_receive_skb_internal+0x1e/0x90
    [] napi_gro_receive+0x78/0xa0
    [] tg3_poll_work+0xc54/0xf40 [tg3]
    [] ? consume_skb+0x2c/0x40
    [] tg3_poll_msix+0x41/0x160 [tg3]
    [] net_rx_action+0xe2/0x290
    [] __do_softirq+0xda/0x1f0
    [] irq_exit+0x76/0xa0
    [] do_IRQ+0x55/0xf0
    [] common_interrupt+0x6b/0x6b

    The issue occurs only when tipc_sk_rcv() is used to wake up postponed
    senders:

    tipc_bclink_wakeup_users()
    // wakeupq - is a queue which consists of special
    // messages with SOCK_WAKEUP type.
    tipc_sk_rcv(wakeupq)
    ...
    while (skb_queue_len(inputq)) {
    filter_rcv(skb)
    // Here the type of message is checked
    // and if it is SOCK_WAKEUP then
    // it tries to wake up a sender.
    tipc_write_space(sk)
    wake_up_interruptible_sync_poll()
    }

    After the sender thread is woke up it can gather control and perform
    an attempt to send a message. But if there is no enough place in send
    queue it will call link_schedule_user() function which puts a message
    of type SOCK_WAKEUP to the wakeup queue and put the sender to sleep.
    Thus the size of the queue actually is not changed and the while()
    loop never exits.

    The approach I proposed is to wake up only senders for which there is
    enough place in send queue so the described issue can't occur.
    Moreover the same approach is already used to wake up senders on
    unicast links.

    I have got into the issue on our product code but to reproduce the
    issue I changed a benchmark test application (from
    tipcutils/demos/benchmark) to perform the following scenario:
    1. Run 64 instances of test application (nodes). It can be done
    on the one physical machine.
    2. Each application connects to all other using TIPC sockets in
    RDM mode.
    3. When setup is done all nodes start simultaneously send
    broadcast messages.
    4. Everything hangs up.

    The issue is reproducible only when a congestion on broadcast link
    occurs. For example, when there are only 8 nodes it works fine since
    congestion doesn't occur. Send queue limit is 40 in my case (I use a
    critical importance level) and when 64 nodes send a message at the
    same moment a congestion occurs every time.

    Signed-off-by: Dmitry S Kolmakov
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Kolmakov Dmitriy
     

21 Jul, 2015

3 commits

  • We convert packet/message reception according to the same principle
    we have been using for message sending and timeout handling:

    We move the function tipc_rcv() to node.c, hence handling the initial
    packet reception at the link aggregation level. The function grabs
    the node lock, selects the receiving link, and accesses it via a new
    call tipc_link_rcv(). This function appends buffers to the input
    queue for delivery upwards, but it may also append outgoing packets
    to the xmit queue, just as we do during regular message sending. The
    latter will happen when buffers are forwarded from the link backlog,
    or when retransmission is requested.

    Upon return of this function, and after having released the node lock,
    tipc_rcv() delivers/tranmsits the contents of those queues, but it may
    also perform actions such as link activation or reset, as indicated by
    the return flags from the link.

    This reduces the number of cpu cycles spent inside the node spinlock,
    and reduces contention on that lock.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The status flag LINK_STOPPED is not needed any more, since the
    mechanism for delayed deletion of links has been removed.
    Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
    because we can just as well start the link timer directly from
    inside tipc_link_create().

    We eliminate these flags in this commit.

    Instead of the above flags, we now introduce three new link modes,
    TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
    indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
    the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
    blocks timer-driven protocol messages to be sent out, and any change
    to the link FSM. Since the modes are mutually exclusive, we convert
    them to state values, and rename the 'flags' field in struct tipc_link
    to 'exec_mode'.

    Finally, we move the #defines for link FSM states and events from link.h
    into enums inside the file link.c, which is the real usage scope of
    these definitions.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • When the function tipc_link_xmit() is given a buffer list for
    transmission, it currently consumes the list both when transmission
    is successful and when it fails, except for the special case when
    it encounters link congestion.

    This behavior is inconsistent, and needs to be corrected if we want
    to avoid problems in later commits in this series.

    In this commit, we change this to let the function consume the list
    only when transmission is successful, and leave the list with the
    sender in all other cases. We also modifiy the socket code so that
    it adapts to this change, i.e., purges the list when a non-congestion
    error code is returned.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy