17 Oct, 2020

2 commits

  • In commit 16ad3f4022bb
    ("tipc: introduce variable window congestion control"), we applied
    the algorithm to select window size from minimum window to the
    configured maximum window for unicast link, and, besides we chose
    to keep the window size for broadcast link unchanged and equal (i.e
    fix window 50)

    However, when setting maximum window variable via command, the window
    variable was re-initialized to unexpect value (i.e 32).

    We fix this by updating the fix window for broadcast as we stated.

    Fixes: 16ad3f4022bb ("tipc: introduce variable window congestion control")
    Acked-by: Jon Maloy
    Signed-off-by: Hoang Huu Le
    Signed-off-by: Jakub Kicinski

    Hoang Huu Le
     
  • The queue limit of the broadcast link is being calculated base on initial
    MTU. However, when MTU value changed (e.g manual changing MTU on NIC
    device, MTU negotiation etc.,) we do not re-calculate queue limit.
    This gives throughput does not reflect with the change.

    So fix it by calling the function to re-calculate queue limit of the
    broadcast link.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Huu Le
    Signed-off-by: Jakub Kicinski

    Hoang Huu Le
     

17 Jun, 2020

1 commit

  • Currently, updating binding table (add service binding to
    name table/withdraw a service binding) is being sent over replicast.
    However, if we are scaling up clusters to > 100 nodes/containers this
    method is less affection because of looping through nodes in a cluster one
    by one.

    It is worth to use broadcast to update a binding service. This way, the
    binding table can be updated on all peer nodes in one shot.

    Broadcast is used when all peer nodes, as indicated by a new capability
    flag TIPC_NAMED_BCAST, support reception of this message type.

    Four problems need to be considered when introducing this feature.
    1) When establishing a link to a new peer node we still update this by a
    unicast 'bulk' update. This may lead to race conditions, where a later
    broadcast publication/withdrawal bypass the 'bulk', resulting in
    disordered publications, or even that a withdrawal may arrive before the
    corresponding publication. We solve this by adding an 'is_last_bulk' bit
    in the last bulk messages so that it can be distinguished from all other
    messages. Only when this message has arrived do we open up for reception
    of broadcast publications/withdrawals.

    2) When a first legacy node is added to the cluster all distribution
    will switch over to use the legacy 'replicast' method, while the
    opposite happens when the last legacy node leaves the cluster. This
    entails another risk of message disordering that has to be handled. We
    solve this by adding a sequence number to the broadcast/replicast
    messages, so that disordering can be discovered and corrected. Note
    however that we don't need to consider potential message loss or
    duplication at this protocol level.

    3) Bulk messages don't contain any sequence numbers, and will always
    arrive in order. Hence we must exempt those from the sequence number
    control and deliver them unconditionally. We solve this by adding a new
    'is_bulk' bit in those messages so that they can be recognized.

    4) Legacy messages, which don't contain any new bits or sequence
    numbers, but neither can arrive out of order, also need to be exempt
    from the initial synchronization and sequence number check, and
    delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
    to all new messages so that those can be distinguished from legacy
    messages and the latter delivered directly.

    v1->v2:
    - fix warning issue reported by kbuild test robot
    - add santiy check to drop the publication message with a sequence
    number that is lower than the agreed synch point

    Signed-off-by: kernel test robot
    Signed-off-by: Hoang Huu Le
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Hoang Huu Le
     

27 May, 2020

3 commits

  • This commit enables dumping the statistics of a broadcast-receiver link
    like the traditional 'broadcast-link' one (which is for broadcast-
    sender). The link dumping can be triggered via netlink (e.g. the
    iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
    indicator.

    The name of a broadcast-receiver link of a specific peer will be in the
    format: 'broadcast-link:'.

    For example:

    Link
    Window:50 packets
    RX packets:7841 fragments:2408/440 bundles:0/0
    TX packets:0 fragments:0/0 bundles:0/0
    RX naks:0 defs:124 dups:0
    TX naks:21 acks:0 retrans:0
    Congestion link:0 Send queue max:0 avg:0

    In addition, the broadcast-receiver link statistics can be reset in the
    usual way via netlink by specifying that link name in command.

    Note: the 'tipc_link_name_ext()' is removed because the link name can
    now be retrieved simply via the 'l->name'.

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • In some environment, broadcast traffic is suppressed at high rate (i.e.
    a kind of bandwidth limit setting). When it is applied, TIPC broadcast
    can still run successfully. However, when it comes to a high load, some
    packets will be dropped first and TIPC tries to retransmit them but the
    packet retransmission is intentionally broadcast too, so making things
    worse and not helpful at all.

    This commit enables the broadcast retransmission via unicast which only
    retransmits packets to the specific peer that has really reported a gap
    i.e. not broadcasting to all nodes in the cluster, so will prevent from
    being suppressed, and also reduce some overheads on the other peers due
    to duplicates, finally improve the overall TIPC broadcast performance.

    Note: the functionality can be turned on/off via the sysctl file:

    echo 1 > /proc/sys/net/tipc/bc_retruni
    echo 0 > /proc/sys/net/tipc/bc_retruni

    Default is '0', i.e. the broadcast retransmission still works as usual.

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • As achieved through commit 9195948fbf34 ("tipc: improve TIPC throughput
    by Gap ACK blocks"), we apply the same mechanism for the broadcast link
    as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will
    consist of two parts built for both the broadcast and unicast types:

    31 16 15 0
    +-------------+-------------+-------------+-------------+
    | bgack_cnt | ugack_cnt | len |
    +-------------+-------------+-------------+-------------+ -
    | gap | ack | |
    +-------------+-------------+-------------+-------------+ > bc gacks
    : : : |
    +-------------+-------------+-------------+-------------+ -
    | gap | ack | |
    +-------------+-------------+-------------+-------------+ > uc gacks
    : : : |
    +-------------+-------------+-------------+-------------+ -

    which is "automatically" backward-compatible.

    We also increase the max number of Gap ACK blocks to 128, allowing upto
    64 blocks per type (total buffer size = 516 bytes).

    Besides, the 'tipc_link_advance_transmq()' function is refactored which
    is applicable for both the unicast and broadcast cases now, so some old
    functions can be removed and the code is optimized.

    With the patch, TIPC broadcast is more robust regardless of packet loss
    or disorder, latency, ... in the underlying network. Its performance is
    boost up significantly.
    For example, experiment with a 5% packet loss rate results:

    $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
    real 0m 42.46s
    user 0m 1.16s
    sys 0m 17.67s

    Without the patch:

    $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
    real 8m 27.94s
    user 0m 0.55s
    sys 0m 2.38s

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

23 Dec, 2019

1 commit


11 Dec, 2019

2 commits

  • In commit c55c8edafa91 ("tipc: smooth change between replicast and
    broadcast"), we allow instant switching between replicast and broadcast
    by sending a dummy 'SYN' packet on the last used link to synchronize
    packets on the links. The 'SYN' message is an object of link congestion
    also, so if that happens, a 'SOCK_WAKEUP' will be scheduled to be sent
    back to the socket...
    However, in that commit, we simply use the same socket 'cong_link_cnt'
    counter for both the 'SYN' & normal payload message sending. Therefore,
    if both the replicast & broadcast links are congested, the counter will
    be not updated correctly but overwritten by the latter congestion.
    Later on, when the 'SOCK_WAKEUP' messages are processed, the counter is
    reduced one by one and eventually overflowed. Consequently, further
    activities on the socket will only wait for the false congestion signal
    to disappear but never been met.

    Because sending the 'SYN' message is vital for the mechanism, it should
    be done anyway. This commit fixes the issue by marking the message with
    an error code e.g. 'TIPC_ERR_NO_PORT', so its sending should not face a
    link congestion, there is no need to touch the socket 'cong_link_cnt'
    either. In addition, in the event of any error (e.g. -ENOBUFS), we will
    purge the entire payload message queue and make a return immediately.

    Fixes: c55c8edafa91 ("tipc: smooth change between replicast and broadcast")
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • We introduce a simple variable window congestion control for links.
    The algorithm is inspired by the Reno algorithm, covering both 'slow
    start', 'congestion avoidance', and 'fast recovery' modes.

    - We introduce hard lower and upper window limits per link, still
    different and configurable per bearer type.

    - We introduce a 'slow start theshold' variable, initially set to
    the maximum window size.

    - We let a link start at the minimum congestion window, i.e. in slow
    start mode, and then let is grow rapidly (+1 per rceived ACK) until
    it reaches the slow start threshold and enters congestion avoidance
    mode.

    - In congestion avoidance mode we increment the congestion window for
    each window-size number of acked packets, up to a possible maximum
    equal to the configured maximum window.

    - For each non-duplicate NACK received, we drop back to fast recovery
    mode, by setting the both the slow start threshold to and the
    congestion window to (current_congestion_window / 2).

    - If the timeout handler finds that the transmit queue has not moved
    since the previous timeout, it drops the link back to slow start
    and forces a probe containing the last sent sequence number to the
    sent to the peer, so that this can discover the stale situation.

    This change does in reality have effect only on unicast ethernet
    transport, as we have seen that there is no room whatsoever for
    increasing the window max size for the UDP bearer.
    For now, we also choose to keep the limits for the broadcast link
    unchanged and equal.

    This algorithm seems to give a 50-100% throughput improvement for
    messages larger than MTU.

    Suggested-by: Xin Long
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

23 Nov, 2019

1 commit

  • When setting up a cluster with non-replicast/replicast capability
    supported. This capability will be disabled for broadcast send link
    in order to be backwards compatible.

    However, when these non-support nodes left and be removed out the cluster.
    We don't update this capability on broadcast send link. Then, some of
    features that based on this capability will also disabling as unexpected.

    In this commit, we make sure the broadcast send link capabilities will
    be re-calculated as soon as a node removed/rejoined a cluster.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

09 Nov, 2019

1 commit

  • This commit offers an option to encrypt and authenticate all messaging,
    including the neighbor discovery messages. The currently most advanced
    algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All
    encryption/decryption is done at the bearer layer, just before leaving
    or after entering TIPC.

    Supported features:
    - Encryption & authentication of all TIPC messages (header + data);
    - Two symmetric-key modes: Cluster and Per-node;
    - Automatic key switching;
    - Key-expired revoking (sequence number wrapped);
    - Lock-free encryption/decryption (RCU);
    - Asynchronous crypto, Intel AES-NI supported;
    - Multiple cipher transforms;
    - Logs & statistics;

    Two key modes:
    - Cluster key mode: One single key is used for both TX & RX in all
    nodes in the cluster.
    - Per-node key mode: Each nodes in the cluster has one specific TX key.
    For RX, a node requires its peers' TX key to be able to decrypt the
    messages from those peers.

    Key setting from user-space is performed via netlink by a user program
    (e.g. the iproute2 'tipc' tool).

    Internal key state machine:

    Attach Align(RX)
    +-+ +-+
    | V | V
    +---------+ Attach +---------+
    | IDLE |---------------->| PENDING |(user = 0)
    +---------+ +---------+
    A A Switch| A
    | | | |
    | | Free(switch/revoked) | |
    (Free)| +----------------------+ | |Timeout
    | (TX) | | |(RX)
    | | | |
    | | v |
    +---------+ Switch +---------+
    | PASSIVE |= 1)

    The number of TFMs is 10 by default and can be changed via the procfs
    'net/tipc/max_tfms'. At this moment, as for simplicity, this file is
    also used to print the crypto statistics at runtime:

    echo 0xfff1 > /proc/sys/net/tipc/max_tfms

    The patch defines a new TIPC version (v7) for the encryption message (-
    backward compatibility as well). The message is basically encapsulated
    as follows:

    +----------------------------------------------------------+
    | TIPCv7 encryption | Original TIPCv2 | Authentication |
    | header | packet (encrypted) | Tag |
    +----------------------------------------------------------+

    The throughput is about ~40% for small messages (compared with non-
    encryption) and ~9% for large messages. With the support from hardware
    crypto i.e. the Intel AES-NI CPU instructions, the throughput increases
    upto ~85% for small messages and ~55% for large messages.

    By default, the new feature is inactive (i.e. no encryption) until user
    sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO"
    in the kernel configuration to enable/disable the new code when needed.

    MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

19 Aug, 2019

1 commit

  • The policy for handling the skb list locks on the send and receive paths
    is simple.

    - On the send path we never need to grab the lock on the 'xmitq' list
    when the destination is an exernal node.

    - On the receive path we always need to grab the lock on the 'inputq'
    list, irrespective of source node.

    However, when transmitting node local messages those will eventually
    end up on the receive path of a local socket, meaning that the argument
    'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the
    function tipc_sk_rcv(). This has been handled by always initializing
    the spinlock of the 'xmitq' list at message creation, just in case it
    may end up on the receive path later, and despite knowing that the lock
    in most cases never will be used.

    This approach is inaccurate and confusing, and has also concealed the
    fact that the stated 'no lock grabbing' policy for the send path is
    violated in some cases.

    We now clean up this by never initializing the lock at message creation,
    instead doing this at the moment we find that the message actually will
    enter the receive path. At the same time we fix the four locations
    where we incorrectly access the spinlock on the send/error path.

    This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock
    is initialised") which has now become redundant.

    CC: Eric Dumazet
    Reported-by: Chris Packham
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Reviewed-by: Xin Long
    Signed-off-by: David S. Miller

    Jon Maloy
     

09 Aug, 2019

1 commit

  • Since node internal messages are passed directly to the socket, it is not
    possible to observe those messages via tcpdump or wireshark.

    We now remedy this by making it possible to clone such messages and send
    the clones to the loopback interface. The clones are dropped at reception
    and have no functional role except making the traffic visible.

    The feature is enabled if network taps are active for the loopback device.
    pcap filtering restrictions require the messages to be presented to the
    receiving side of the loopback device.

    v3 - Function dev_nit_active used to check for network taps.
    - Procedure netif_rx_ni used to send cloned messages to loopback device.

    Signed-off-by: John Rutherford
    Acked-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    John Rutherford
     

26 Jun, 2019

1 commit


05 Apr, 2019

1 commit

  • skb somehow dequeued out of inputq before processing, it causes to
    NULL pointer and kernel crashed.

    Add checking skb valid before using.

    Fixes: c55c8edafa9 ("tipc: smooth change between replicast and broadcast")
    Reported-by: Tuong Lien Tong
    Acked-by: Ying Xue
    Signed-off-by: Hoang Le
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Hoang Le
     

27 Mar, 2019

1 commit


22 Mar, 2019

1 commit

  • In commit c55c8edafa91 ("tipc: smooth change between replicast and
    broadcast") we introduced new method to eliminate the risk of message
    reordering that happen in between different nodes.
    Unfortunately, we forgot checking at receiving side to ignore intra node.

    We fix this by checking and returning if arrived message from intra node.

    syzbot report:

    ==================================================================
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
    Hardware name: Google Google Compute Engine/Google Compute Engine,
    BIOS Google 01/01/2011
    RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
    Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
    24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
    3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
    RSP: 0018:ffff8880959defc8 EFLAGS: 00010202
    RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000
    RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8
    RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8
    R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000
    R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48
    FS: 000000000106a880(0000) GS:ffff8880ae800000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
    tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
    tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
    tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
    tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
    tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
    __tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
    tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
    sock_sendmsg_nosec net/socket.c:651 [inline]
    sock_sendmsg+0xdd/0x130 net/socket.c:661
    ___sys_sendmsg+0x806/0x930 net/socket.c:2260
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2298
    __do_sys_sendmsg net/socket.c:2307 [inline]
    __se_sys_sendmsg net/socket.c:2305 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4401c9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
    48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
    3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
    RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50
    R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    ---[ end trace ba79875754e1708f ]---

    Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
    Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

20 Mar, 2019

2 commits

  • Currently, a multicast stream may start out using replicast, because
    there are few destinations, and then it should ideally switch to
    L2/broadcast IGMP/multicast when the number of destinations grows beyond
    a certain limit. The opposite should happen when the number decreases
    below the limit.

    To eliminate the risk of message reordering caused by method change,
    a sending socket must stick to a previously selected method until it
    enters an idle period of 5 seconds. Means there is a 5 seconds pause
    in the traffic from the sender socket.

    If the sender never makes such a pause, the method will never change,
    and transmission may become very inefficient as the cluster grows.

    With this commit, we allow such a switch between replicast and
    broadcast without any need for a traffic pause.

    Solution is to send a dummy message with only the header, also with
    the SYN bit set, via broadcast or replicast. For the data message,
    the SYN bit is set and sending via replicast or broadcast (inverse
    method with dummy).

    Then, at receiving side any messages follow first SYN bit message
    (data or dummy message), they will be held in deferred queue until
    another pair (dummy or data message) arrived in other link.

    v2: reverse christmas tree declaration

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     
  • Currently, a multicast stream uses either broadcast or replicast as
    transmission method, based on the ratio between number of actual
    destinations nodes and cluster size.

    However, when an L2 interface (e.g., VXLAN) provides pseudo
    broadcast support, this becomes very inefficient, as it blindly
    replicates multicast packets to all cluster/subnet nodes,
    irrespective of whether they host actual target sockets or not.

    The TIPC multicast algorithm is able to distinguish real destination
    nodes from other nodes, and hence provides a smarter and more
    efficient method for transferring multicast messages than
    pseudo broadcast can do.

    Because of this, we now make it possible for users to force
    the broadcast link to permanently switch to using replicast,
    irrespective of which capabilities the bearer provides,
    or pretend to provide.
    Conversely, we also make it possible to force the broadcast link
    to always use true broadcast. While maybe less useful in
    deployed systems, this may at least be useful for testing the
    broadcast algorithm in small clusters.

    We retain the current AUTOSELECT ability, i.e., to let the broadcast link
    automatically select which algorithm to use, and to switch back and forth
    between broadcast and replicast as the ratio between destination
    node number and cluster size changes. This remains the default method.

    Furthermore, we make it possible to configure the threshold ratio for
    such switches. The default ratio is now set to 10%, down from 25% in the
    earlier implementation.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

04 Sep, 2018

1 commit


28 Jul, 2018

1 commit


08 Mar, 2018

1 commit


02 Dec, 2017

1 commit

  • When sending node local messages the code is using an 'mtu' of 66060
    bytes to avoid unnecessary fragmentation. During situations of low
    memory tipc_msg_build() may sometimes fail to allocate such large
    buffers, resulting in unnecessary send failures. This can easily be
    remedied by falling back to a smaller MTU, and then reassemble the
    buffer chain as if the message were arriving from a remote node.

    At the same time, we change the initial MTU setting of the broadcast
    link to a lower value, so that large messages always are fragmented
    into smaller buffers even when we run in single node mode. Apart from
    obtaining the same advantage as for the 'fallback' solution above, this
    turns out to give a significant performance improvement. This can
    probably be explained with the __pskb_copy() operation performed on the
    buffer for each recipient during reception. We found the optimal value
    for this, considering the most relevant skb pool, to be 3744 bytes.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

13 Oct, 2017

1 commit

  • We often see a need for a linked list of destination identities,
    sometimes containing a port number, sometimes a node identity, and
    sometimes both. The currently defined struct u32_list is not generic
    enough to cover all cases, so we extend it to contain two u32 integers
    and rename it to struct tipc_dest_list.

    Signed-off-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Maloy
     

09 Oct, 2017

1 commit

  • We change the initialization of the skb transmit buffer queues
    in the functions tipc_bcast_xmit() and tipc_rcast_xmit() to also
    initialize their spinlocks. This is needed because we may, during
    error conditions, need to call skb_queue_purge() on those queues
    further down the stack.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

21 Jan, 2017

4 commits

  • If the bearer carrying multicast messages supports broadcast, those
    messages will be sent to all cluster nodes, irrespective of whether
    these nodes host any actual destinations socket or not. This is clearly
    wasteful if the cluster is large and there are only a few real
    destinations for the message being sent.

    In this commit we extend the eligibility of the newly introduced
    "replicast" transmit option. We now make it possible for a user to
    select which method he wants to be used, either as a mandatory setting
    via setsockopt(), or as a relative setting where we let the broadcast
    layer decide which method to use based on the ratio between cluster
    size and the message's actual number of destination nodes.

    In the latter case, a sending socket must stick to a previously
    selected method until it enters an idle period of at least 5 seconds.
    This eliminates the risk of message reordering caused by method change,
    i.e., when changes to cluster size or number of destinations would
    otherwise mandate a new method to be used.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • TIPC multicast messages are currently carried over a reliable
    'broadcast link', making use of the underlying media's ability to
    transport packets as L2 broadcast or IP multicast to all nodes in
    the cluster.

    When the used bearer is lacking that ability, we can instead emulate
    the broadcast service by replicating and sending the packets over as
    many unicast links as needed to reach all identified destinations.
    We now introduce a new TIPC link-level 'replicast' service that does
    this.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As a further preparation for the upcoming 'replicast' functionality,
    we add some necessary structs and functions for looking up and returning
    a list of all nodes that host destinations for a given multicast message.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As a preparation for the 'replicast' functionality we are going to
    introduce in the next commits, we need the broadcast base structure to
    store whether bearer broadcast is available at all from the currently
    used bearer or bearers.

    We do this by adding a new function tipc_bearer_bcast_support() to
    the bearer layer, and letting the bearer selection function in
    bcast.c use this to give a new boolean field, 'bcast_support' the
    appropriate value.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

04 Jan, 2017

1 commit

  • The socket code currently handles link congestion by either blocking
    and trying to send again when the congestion has abated, or just
    returning to the user with -EAGAIN and let him re-try later.

    This mechanism is prone to starvation, because the wakeup algorithm is
    non-atomic. During the time the link issues a wakeup signal, until the
    socket wakes up and re-attempts sending, other senders may have come
    in between and occupied the free buffer space in the link. This in turn
    may lead to a socket having to make many send attempts before it is
    successful. In extremely loaded systems we have observed latency times
    of several seconds before a low-priority socket is able to send out a
    message.

    In this commit, we simplify this mechanism and reduce the risk of the
    described scenario happening. When a message is attempted sent via a
    congested link, we now let it be added to the link's backlog queue
    anyway, thus permitting an oversubscription of one message per source
    socket. We still create a wakeup item and return an error code, hence
    instructing the sender to block or stop sending. Only when enough space
    has been freed up in the link's backlog queue do we issue a wakeup event
    that allows the sender to continue with the next message, if any.

    The fact that a socket now can consider a message sent even when the
    link returns a congestion code means that the sending socket code can
    be simplified. Also, since this is a good opportunity to get rid of the
    obsolete 'mtu change' condition in the three socket send functions, we
    now choose to refactor those functions completely.

    Signed-off-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

30 Oct, 2016

1 commit

  • In commit 2d18ac4ba745 ("tipc: extend broadcast link initialization
    criteria") we tried to fix a problem with the initial synchronization
    of broadcast link acknowledge values. Unfortunately that solution is
    not sufficient to solve the issue.

    We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
    non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
    initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
    broadcast acknowledge numbers, leading to premature opening of the
    broadcast link. When the bypassed packets finally arrive, they are
    inadvertently accepted, and the already correctly initialized
    acknowledge number in the broadcast receive link is overwritten by
    the invalid (zero) value of the said packets. After this the broadcast
    link goes stale.

    We now fix this by marking the packets where we know the acknowledge
    value is or may be invalid, and then ignoring the acks from those.

    To this purpose, we claim an unused bit in the header to indicate that
    the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
    synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
    packets, plus those LINK_PROTOCOL packets sent out before the broadcast
    links are fully synchronized.

    This minor protocol update is fully backwards compatible.

    Reported-by: John Thompson
    Tested-by: John Thompson
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

03 Sep, 2016

1 commit

  • When we send broadcasts in clusters of more 70-80 nodes, we sometimes
    see the broadcast link resetting because of an excessive number of
    retransmissions. This is caused by a combination of two factors:

    1) A 'NACK crunch", where loss of broadcast packets is discovered
    and NACK'ed by several nodes simultaneously, leading to multiple
    redundant broadcast retransmissions.

    2) The fact that the NACKS as such also are sent as broadcast, leading
    to excessive load and packet loss on the transmitting switch/bridge.

    This commit deals with the latter problem, by moving sending of
    broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
    to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
    bits in word 8 of the said message for this purpose, and introduce a
    new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
    backwards compatible.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

07 Mar, 2016

1 commit

  • Until now, we have kept a pre-allocated protocol message header
    aggregated into struct tipc_link. Apart from adding unnecessary
    footprint to the link instances, this requires extra code both to
    initialize and re-initialize it.

    We now remove this sub-optimization. This change also makes it
    possible to clean up the function tipc_build_proto_msg() and remove
    a couple of small functions that were accessing the mentioned header.
    In particular, we can replace all occurrences of the local function
    call link_own_addr(link) with the generic tipc_own_addr(net).

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

21 Nov, 2015

1 commit

  • We move the definition of struct tipc_link from link.h to link.c in
    order to minimize its exposure to the rest of the code.

    When needed, we define new functions to make it possible for external
    entities to access and set data in the link.

    Apart from the above, there are no functional changes.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

24 Oct, 2015

6 commits

  • After the previous changes in this series, we can now remove some
    unused code and structures, both in the broadcast, link aggregation
    and link code.

    There are no functional changes in this commit.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • With the recent commit series, we have established a one-way dependency
    between the link aggregation (struct tipc_node) instances and their
    pertaining tipc_link instances. This has enabled quite significant code
    and structure simplifications.

    In this commit, we eliminate the field 'owner', which points to an
    instance of struct tipc_node, from struct tipc_link, and replace it with
    a pointer to struct net, which is the only external reference now needed
    by a link instance.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Until now, we have only been supporting a fix MTU size of 1500 bytes
    for all broadcast media, irrespective of their actual capability.

    We now make the broadcast MTU adaptable to the carrying media, i.e.,
    we use the smallest MTU supported by any of the interfaces attached
    to TIPC.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Until now, we have been keeping track of the exact set of broadcast
    destinations though the help structure tipc_node_map. This leads us to
    have to maintain a whole infrastructure for supporting this, including
    a pseudo-bearer and a number of functions to manipulate both the bearers
    and the node map correctly. Apart from the complexity, this approach is
    also limiting, as struct tipc_node_map only can support cluster local
    broadcast if we want to avoid it becoming excessively large. We want to
    eliminate this limitation, in order to enable introduction of scoped
    multicast in the future.

    A closer analysis reveals that it is unnecessary maintaining this "full
    set" overview; it is sufficient to keep a counter per bearer, indicating
    how many nodes can be reached via this bearer at the moment. The protocol
    is now robust enough to handle transitional discrepancies between the
    nominal number of reachable destinations, as expected by the broadcast
    protocol itself, and the number which is actually reachable at the
    moment. The initial broadcast synchronization, in conjunction with the
    retransmission mechanism, ensures that all packets will eventually be
    acknowledged by the correct set of destinations.

    This commit introduces these changes.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The code path for receiving broadcast packets is currently distinct
    from the unicast path. This leads to unnecessary code and data
    duplication, something that can be avoided with some effort.

    We now introduce separate per-peer tipc_link instances for handling
    broadcast packet reception. Each receive link keeps a pointer to the
    common, single, broadcast link instance, and can hence handle release
    and retransmission of send buffers as if they belonged to the own
    instance.

    Furthermore, we let each unicast link instance keep a reference to both
    the pertaining broadcast receive link, and to the common send link.
    This makes it possible for the unicast links to easily access data for
    broadcast link synchronization, as well as for carrying acknowledges for
    received broadcast packets.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Until now, we have tried to support both the newer, dedicated broadcast
    synchronization mechanism along with the older, less safe, RESET_MSG/
    ACTIVATE_MSG based one. The latter method has turned out to be a hazard
    in a highly dynamic cluster, so we find it safer to disable it completely
    when we find that the former mechanism is supported by the peer node.

    For this purpose, we now introduce a new capabability bit,
    TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast
    syncronization is supported by the present node. The new bit is conveyed
    between peers in the 'capabilities' field of neighbor discovery messages.

    Signed-off-by: Jon Maloy
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Jon Paul Maloy