23 Jan, 2020
1 commit
-
commit dca4a17d24ee9d878836ce5eb8dc25be1ffa5729 upstream.
In commit c55c8edafa91 ("tipc: smooth change between replicast and
broadcast"), we allow instant switching between replicast and broadcast
by sending a dummy 'SYN' packet on the last used link to synchronize
packets on the links. The 'SYN' message is an object of link congestion
also, so if that happens, a 'SOCK_WAKEUP' will be scheduled to be sent
back to the socket...
However, in that commit, we simply use the same socket 'cong_link_cnt'
counter for both the 'SYN' & normal payload message sending. Therefore,
if both the replicast & broadcast links are congested, the counter will
be not updated correctly but overwritten by the latter congestion.
Later on, when the 'SOCK_WAKEUP' messages are processed, the counter is
reduced one by one and eventually overflowed. Consequently, further
activities on the socket will only wait for the false congestion signal
to disappear but never been met.Because sending the 'SYN' message is vital for the mechanism, it should
be done anyway. This commit fixes the issue by marking the message with
an error code e.g. 'TIPC_ERR_NO_PORT', so its sending should not face a
link congestion, there is no need to touch the socket 'cong_link_cnt'
either. In addition, in the event of any error (e.g. -ENOBUFS), we will
purge the entire payload message queue and make a return immediately.Fixes: c55c8edafa91 ("tipc: smooth change between replicast and broadcast")
Acked-by: Jon Maloy
Signed-off-by: Tuong Lien
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
19 Aug, 2019
1 commit
-
The policy for handling the skb list locks on the send and receive paths
is simple.- On the send path we never need to grab the lock on the 'xmitq' list
when the destination is an exernal node.- On the receive path we always need to grab the lock on the 'inputq'
list, irrespective of source node.However, when transmitting node local messages those will eventually
end up on the receive path of a local socket, meaning that the argument
'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the
function tipc_sk_rcv(). This has been handled by always initializing
the spinlock of the 'xmitq' list at message creation, just in case it
may end up on the receive path later, and despite knowing that the lock
in most cases never will be used.This approach is inaccurate and confusing, and has also concealed the
fact that the stated 'no lock grabbing' policy for the send path is
violated in some cases.We now clean up this by never initializing the lock at message creation,
instead doing this at the moment we find that the message actually will
enter the receive path. At the same time we fix the four locations
where we incorrectly access the spinlock on the send/error path.This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock
is initialised") which has now become redundant.CC: Eric Dumazet
Reported-by: Chris Packham
Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Reviewed-by: Xin Long
Signed-off-by: David S. Miller
09 Aug, 2019
1 commit
-
Since node internal messages are passed directly to the socket, it is not
possible to observe those messages via tcpdump or wireshark.We now remedy this by making it possible to clone such messages and send
the clones to the loopback interface. The clones are dropped at reception
and have no functional role except making the traffic visible.The feature is enabled if network taps are active for the loopback device.
pcap filtering restrictions require the messages to be presented to the
receiving side of the loopback device.v3 - Function dev_nit_active used to check for network taps.
- Procedure netif_rx_ni used to send cloned messages to loopback device.Signed-off-by: John Rutherford
Acked-by: Jon Maloy
Acked-by: Ying Xue
Signed-off-by: David S. Miller
26 Jun, 2019
1 commit
-
We rename the inline function msg_get_wrapped() to the more
comprehensible msg_inner_hdr().Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
05 Apr, 2019
1 commit
-
skb somehow dequeued out of inputq before processing, it causes to
NULL pointer and kernel crashed.Add checking skb valid before using.
Fixes: c55c8edafa9 ("tipc: smooth change between replicast and broadcast")
Reported-by: Tuong Lien Tong
Acked-by: Ying Xue
Signed-off-by: Hoang Le
Acked-by: Jon Maloy
Signed-off-by: David S. Miller
27 Mar, 2019
1 commit
-
Fix the return value check which testing the wrong variable
in tipc_mcast_send_sync().Fixes: c55c8edafa91 ("tipc: smooth change between replicast and broadcast")
Reported-by: Hulk Robot
Signed-off-by: Wei Yongjun
Acked-by: Jon Maloy
Signed-off-by: David S. Miller
22 Mar, 2019
1 commit
-
In commit c55c8edafa91 ("tipc: smooth change between replicast and
broadcast") we introduced new method to eliminate the risk of message
reordering that happen in between different nodes.
Unfortunately, we forgot checking at receiving side to ignore intra node.We fix this by checking and returning if arrived message from intra node.
syzbot report:
==================================================================
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
RSP: 0018:ffff8880959defc8 EFLAGS: 00010202
RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8
RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8
R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000
R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48
FS: 000000000106a880(0000) GS:ffff8880ae800000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
__tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
sock_sendmsg_nosec net/socket.c:651 [inline]
sock_sendmsg+0xdd/0x130 net/socket.c:661
___sys_sendmsg+0x806/0x930 net/socket.c:2260
__sys_sendmsg+0x105/0x1d0 net/socket.c:2298
__do_sys_sendmsg net/socket.c:2307 [inline]
__se_sys_sendmsg net/socket.c:2305 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4401c9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50
R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace ba79875754e1708f ]---Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
Acked-by: Jon Maloy
Signed-off-by: Hoang Le
Signed-off-by: David S. Miller
20 Mar, 2019
2 commits
-
Currently, a multicast stream may start out using replicast, because
there are few destinations, and then it should ideally switch to
L2/broadcast IGMP/multicast when the number of destinations grows beyond
a certain limit. The opposite should happen when the number decreases
below the limit.To eliminate the risk of message reordering caused by method change,
a sending socket must stick to a previously selected method until it
enters an idle period of 5 seconds. Means there is a 5 seconds pause
in the traffic from the sender socket.If the sender never makes such a pause, the method will never change,
and transmission may become very inefficient as the cluster grows.With this commit, we allow such a switch between replicast and
broadcast without any need for a traffic pause.Solution is to send a dummy message with only the header, also with
the SYN bit set, via broadcast or replicast. For the data message,
the SYN bit is set and sending via replicast or broadcast (inverse
method with dummy).Then, at receiving side any messages follow first SYN bit message
(data or dummy message), they will be held in deferred queue until
another pair (dummy or data message) arrived in other link.v2: reverse christmas tree declaration
Acked-by: Jon Maloy
Signed-off-by: Hoang Le
Signed-off-by: David S. Miller -
Currently, a multicast stream uses either broadcast or replicast as
transmission method, based on the ratio between number of actual
destinations nodes and cluster size.However, when an L2 interface (e.g., VXLAN) provides pseudo
broadcast support, this becomes very inefficient, as it blindly
replicates multicast packets to all cluster/subnet nodes,
irrespective of whether they host actual target sockets or not.The TIPC multicast algorithm is able to distinguish real destination
nodes from other nodes, and hence provides a smarter and more
efficient method for transferring multicast messages than
pseudo broadcast can do.Because of this, we now make it possible for users to force
the broadcast link to permanently switch to using replicast,
irrespective of which capabilities the bearer provides,
or pretend to provide.
Conversely, we also make it possible to force the broadcast link
to always use true broadcast. While maybe less useful in
deployed systems, this may at least be useful for testing the
broadcast algorithm in small clusters.We retain the current AUTOSELECT ability, i.e., to let the broadcast link
automatically select which algorithm to use, and to switch back and forth
between broadcast and replicast as the ratio between destination
node number and cluster size changes. This remains the default method.Furthermore, we make it possible to configure the threshold ratio for
such switches. The default ratio is now set to 10%, down from 25% in the
earlier implementation.Acked-by: Jon Maloy
Signed-off-by: Hoang Le
Signed-off-by: David S. Miller
04 Sep, 2018
1 commit
-
Trivial fix for two spelling mistakes.
Signed-off-by: Zhenbo Gao
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller
28 Jul, 2018
1 commit
-
tipc_bcast_init() is never called in atomic context.
It calls kzalloc() with GFP_ATOMIC, which is not necessary.
GFP_ATOMIC can be replaced with GFP_KERNEL.This is found by a static analysis tool named DCNS written by myself.
Signed-off-by: Jia-Ju Bai
Signed-off-by: David S. Miller
08 Mar, 2018
1 commit
-
Assign true or false to boolean variables instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva
Acked-by: Ying Xue
Signed-off-by: David S. Miller
02 Dec, 2017
1 commit
-
When sending node local messages the code is using an 'mtu' of 66060
bytes to avoid unnecessary fragmentation. During situations of low
memory tipc_msg_build() may sometimes fail to allocate such large
buffers, resulting in unnecessary send failures. This can easily be
remedied by falling back to a smaller MTU, and then reassemble the
buffer chain as if the message were arriving from a remote node.At the same time, we change the initial MTU setting of the broadcast
link to a lower value, so that large messages always are fragmented
into smaller buffers even when we run in single node mode. Apart from
obtaining the same advantage as for the 'fallback' solution above, this
turns out to give a significant performance improvement. This can
probably be explained with the __pskb_copy() operation performed on the
buffer for each recipient during reception. We found the optimal value
for this, considering the most relevant skb pool, to be 3744 bytes.Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
13 Oct, 2017
1 commit
-
We often see a need for a linked list of destination identities,
sometimes containing a port number, sometimes a node identity, and
sometimes both. The currently defined struct u32_list is not generic
enough to cover all cases, so we extend it to contain two u32 integers
and rename it to struct tipc_dest_list.Signed-off-by: Jon Maloy
Acked-by: Ying Xue
Signed-off-by: David S. Miller
09 Oct, 2017
1 commit
-
We change the initialization of the skb transmit buffer queues
in the functions tipc_bcast_xmit() and tipc_rcast_xmit() to also
initialize their spinlocks. This is needed because we may, during
error conditions, need to call skb_queue_purge() on those queues
further down the stack.Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
21 Jan, 2017
4 commits
-
If the bearer carrying multicast messages supports broadcast, those
messages will be sent to all cluster nodes, irrespective of whether
these nodes host any actual destinations socket or not. This is clearly
wasteful if the cluster is large and there are only a few real
destinations for the message being sent.In this commit we extend the eligibility of the newly introduced
"replicast" transmit option. We now make it possible for a user to
select which method he wants to be used, either as a mandatory setting
via setsockopt(), or as a relative setting where we let the broadcast
layer decide which method to use based on the ratio between cluster
size and the message's actual number of destination nodes.In the latter case, a sending socket must stick to a previously
selected method until it enters an idle period of at least 5 seconds.
This eliminates the risk of message reordering caused by method change,
i.e., when changes to cluster size or number of destinations would
otherwise mandate a new method to be used.Reviewed-by: Parthasarathy Bhuvaragan
Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller -
TIPC multicast messages are currently carried over a reliable
'broadcast link', making use of the underlying media's ability to
transport packets as L2 broadcast or IP multicast to all nodes in
the cluster.When the used bearer is lacking that ability, we can instead emulate
the broadcast service by replicating and sending the packets over as
many unicast links as needed to reach all identified destinations.
We now introduce a new TIPC link-level 'replicast' service that does
this.Reviewed-by: Parthasarathy Bhuvaragan
Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller -
As a further preparation for the upcoming 'replicast' functionality,
we add some necessary structs and functions for looking up and returning
a list of all nodes that host destinations for a given multicast message.Reviewed-by: Parthasarathy Bhuvaragan
Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller -
As a preparation for the 'replicast' functionality we are going to
introduce in the next commits, we need the broadcast base structure to
store whether bearer broadcast is available at all from the currently
used bearer or bearers.We do this by adding a new function tipc_bearer_bcast_support() to
the bearer layer, and letting the bearer selection function in
bcast.c use this to give a new boolean field, 'bcast_support' the
appropriate value.Reviewed-by: Parthasarathy Bhuvaragan
Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
04 Jan, 2017
1 commit
-
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.Signed-off-by: Parthasarathy Bhuvaragan
Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
30 Oct, 2016
1 commit
-
In commit 2d18ac4ba745 ("tipc: extend broadcast link initialization
criteria") we tried to fix a problem with the initial synchronization
of broadcast link acknowledge values. Unfortunately that solution is
not sufficient to solve the issue.We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
broadcast acknowledge numbers, leading to premature opening of the
broadcast link. When the bypassed packets finally arrive, they are
inadvertently accepted, and the already correctly initialized
acknowledge number in the broadcast receive link is overwritten by
the invalid (zero) value of the said packets. After this the broadcast
link goes stale.We now fix this by marking the packets where we know the acknowledge
value is or may be invalid, and then ignoring the acks from those.To this purpose, we claim an unused bit in the header to indicate that
the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
packets, plus those LINK_PROTOCOL packets sent out before the broadcast
links are fully synchronized.This minor protocol update is fully backwards compatible.
Reported-by: John Thompson
Tested-by: John Thompson
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
03 Sep, 2016
1 commit
-
When we send broadcasts in clusters of more 70-80 nodes, we sometimes
see the broadcast link resetting because of an excessive number of
retransmissions. This is caused by a combination of two factors:1) A 'NACK crunch", where loss of broadcast packets is discovered
and NACK'ed by several nodes simultaneously, leading to multiple
redundant broadcast retransmissions.2) The fact that the NACKS as such also are sent as broadcast, leading
to excessive load and packet loss on the transmitting switch/bridge.This commit deals with the latter problem, by moving sending of
broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
bits in word 8 of the said message for this purpose, and introduce a
new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
backwards compatible.Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
07 Mar, 2016
1 commit
-
Until now, we have kept a pre-allocated protocol message header
aggregated into struct tipc_link. Apart from adding unnecessary
footprint to the link instances, this requires extra code both to
initialize and re-initialize it.We now remove this sub-optimization. This change also makes it
possible to clean up the function tipc_build_proto_msg() and remove
a couple of small functions that were accessing the mentioned header.
In particular, we can replace all occurrences of the local function
call link_own_addr(link) with the generic tipc_own_addr(net).Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
21 Nov, 2015
1 commit
-
We move the definition of struct tipc_link from link.h to link.c in
order to minimize its exposure to the rest of the code.When needed, we define new functions to make it possible for external
entities to access and set data in the link.Apart from the above, there are no functional changes.
Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
24 Oct, 2015
11 commits
-
After the previous changes in this series, we can now remove some
unused code and structures, both in the broadcast, link aggregation
and link code.There are no functional changes in this commit.
Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
With the recent commit series, we have established a one-way dependency
between the link aggregation (struct tipc_node) instances and their
pertaining tipc_link instances. This has enabled quite significant code
and structure simplifications.In this commit, we eliminate the field 'owner', which points to an
instance of struct tipc_node, from struct tipc_link, and replace it with
a pointer to struct net, which is the only external reference now needed
by a link instance.Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
Until now, we have only been supporting a fix MTU size of 1500 bytes
for all broadcast media, irrespective of their actual capability.We now make the broadcast MTU adaptable to the carrying media, i.e.,
we use the smallest MTU supported by any of the interfaces attached
to TIPC.Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
Until now, we have been keeping track of the exact set of broadcast
destinations though the help structure tipc_node_map. This leads us to
have to maintain a whole infrastructure for supporting this, including
a pseudo-bearer and a number of functions to manipulate both the bearers
and the node map correctly. Apart from the complexity, this approach is
also limiting, as struct tipc_node_map only can support cluster local
broadcast if we want to avoid it becoming excessively large. We want to
eliminate this limitation, in order to enable introduction of scoped
multicast in the future.A closer analysis reveals that it is unnecessary maintaining this "full
set" overview; it is sufficient to keep a counter per bearer, indicating
how many nodes can be reached via this bearer at the moment. The protocol
is now robust enough to handle transitional discrepancies between the
nominal number of reachable destinations, as expected by the broadcast
protocol itself, and the number which is actually reachable at the
moment. The initial broadcast synchronization, in conjunction with the
retransmission mechanism, ensures that all packets will eventually be
acknowledged by the correct set of destinations.This commit introduces these changes.
Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
The code path for receiving broadcast packets is currently distinct
from the unicast path. This leads to unnecessary code and data
duplication, something that can be avoided with some effort.We now introduce separate per-peer tipc_link instances for handling
broadcast packet reception. Each receive link keeps a pointer to the
common, single, broadcast link instance, and can hence handle release
and retransmission of send buffers as if they belonged to the own
instance.Furthermore, we let each unicast link instance keep a reference to both
the pertaining broadcast receive link, and to the common send link.
This makes it possible for the unicast links to easily access data for
broadcast link synchronization, as well as for carrying acknowledges for
received broadcast packets.Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
Until now, we have tried to support both the newer, dedicated broadcast
synchronization mechanism along with the older, less safe, RESET_MSG/
ACTIVATE_MSG based one. The latter method has turned out to be a hazard
in a highly dynamic cluster, so we find it safer to disable it completely
when we find that the former mechanism is supported by the peer node.For this purpose, we now introduce a new capabability bit,
TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast
syncronization is supported by the present node. The new bit is conveyed
between peers in the 'capabilities' field of neighbor discovery messages.Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
This commit simplifies the broadcast link transmission function, by
leveraging previous changes to the link transmission function and the
broadcast transmission link life cycle.Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
The broadcast link instance (struct tipc_link) used for sending is
currently aggregated into struct tipc_bclink. This means that we cannot
use the regular tipc_link_create() function for initiating the link, but
do instead have to initiate numerous fields directly from the
bcast_init() function.We want to reduce dependencies between the broadcast functionality
and the inner workings of tipc_link. In this commit, we introduce
a new function tipc_bclink_create() to link.c, and allocate the
instance of the link separately using this function.Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
The broadcast transmission link is currently instantiated when the
network subsystem is started, i.e., on order from user space via netlink.This forces the broadcast transmission code to do unnecessary tests for
the existence of the transmission link, as well in single mode node as
in network mode.In this commit, we do instead create the link during initialization of
the name space, and remove it when it is stopped. The fact that the
transmission link now has a guaranteed longer life cycle than any of its
potential clients paves the way for further code simplifcations
and optimizations.Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
The broadcast lock will need to be acquired outside bcast.c in a later
commit. For this reason, we move the lock to struct tipc_net. Consistent
with the changes in the previous commit, we also introducee two new
functions tipc_bcast_lock() and tipc_bcast_unlock(). The code that is
currently using tipc_bclink_lock()/unlock() will be phased out during
the coming commits in this series.Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller -
Currently, a number of structure and function definitions related
to the broadcast functionality are unnecessarily exposed in the file
bcast.h. This obscures the fact that the external interface towards
the broadcast link in fact is very narrow, and causes unnecessary
recompilations of other files when anything changes in those
definitions.In this commit, we move as many of those definitions as is currently
possible to the file bcast.c.We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
since the name does not correctly describe the contents of this
struct, and will do so even less in the future, and because we want
to use the term 'link' more appropriately in the functionality
introduced later in this series.Finally, we rename a couple of functions, such as tipc_bclink_xmit()
and others that will be kept in the future, to include the term 'bcast'
instead.There are no functional changes in this commit.
Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller
22 Oct, 2015
1 commit
-
The default fix broadcast window size is currently set to 20 packets.
This is a very low value, set at a time when we were still testing on
10 Mb/s hubs, and a change to it is long overdue.Commit 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure")
revealed a problem with this low value. For messages of importance LOW,
the backlog queue limit will be calculated to 30 packets, while a
single, maximum sized message of 66000 bytes, carried across a 1500 MTU
network consists of 46 packets.This leads to the following scenario (among others leading to the same
situation):1: Msg 1 of 46 packets is sent. 20 packets go to the transmit queue, 26
packets to the backlog queue.
2: Msg 2 of 46 packets is attempted sent, but rejected because there is
no more space in the backlog queue at this level. The sender is added
to the wakeup queue with a "pending packets chain size" number of 46.
3: Some packets in the transmit queue are acked and released. We try to
wake up the sender, but the pending size of 46 is bigger than the LOW
wakeup limit of 30, so this doesn't happen.
5: Subsequent acks releases all the remaining buffers. Each time we test
for the wakeup criteria and find that 46 still is larger than 30,
even after both the transmit and the backlog queues are empty.
6: The sender is never woken up and given a chance to send its message.
He is stuck.We could now loosen the wakeup criteria (used by link_prepare_wakeup())
to become equal to the send criteria (used by tipc_link_xmit()), i.e.,
by ignoring the "pending packets chain size" value altogether, or we can
just increase the queue limits so that the criteria can be satisfied
anyway. There are good reasons (potentially multiple waiting senders) to
not opt for the former solution, so we choose the latter one.This commit fixes the problem by giving the broadcast link window a
default value of 50 packets. We also introduce a new minimum link
window size BCLINK_MIN_WIN of 32, which is enough to always avoid the
described situation. Finally, in order to not break any existing users
which may set the window explicitly, we enforce that the window is set
to the new minimum value in case the user is trying to set it to
anything lower.Fixes: 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure")
Signed-off-by: Jon Maloy
Reviewed-by: Ying Xue
Signed-off-by: David S. Miller
09 Sep, 2015
1 commit
-
If an attempt to wake up users of broadcast link is made when there is
no enough place in send queue than it may hang up inside the
tipc_sk_rcv() function since the loop breaks only after the wake up
queue becomes empty. This can lead to complete CPU stall with the
following message generated by RCU:INFO: rcu_sched self-detected stall on CPU { 0} (t=2101 jiffies
g=54225 c=54224 q=11465)
Task dump for CPU 0:
tpch R running task 0 39949 39948 0x0000000a
ffffffff818536c0 ffff88181fa037a0 ffffffff8106a4be 0000000000000000
ffffffff818536c0 ffff88181fa037c0 ffffffff8106d8a8 ffff88181fa03800
0000000000000001 ffff88181fa037f0 ffffffff81094a50 ffff88181fa15680
Call Trace:
[] sched_show_task+0xae/0x120
[] dump_cpu_task+0x38/0x40
[] rcu_dump_cpu_stacks+0x90/0xd0
[] rcu_check_callbacks+0x3eb/0x6e0
[] ? account_system_time+0x7f/0x170
[] update_process_times+0x34/0x60
[] tick_sched_handle.isra.18+0x31/0x40
[] tick_sched_timer+0x3c/0x70
[] __run_hrtimer.isra.34+0x3d/0xc0
[] hrtimer_interrupt+0xc5/0x1e0
[] ? native_smp_send_reschedule+0x42/0x60
[] local_apic_timer_interrupt+0x34/0x60
[] smp_apic_timer_interrupt+0x3c/0x60
[] apic_timer_interrupt+0x6b/0x70
[] ? _raw_spin_unlock_irqrestore+0x9/0x10
[] __wake_up_sync_key+0x4f/0x60
[] tipc_write_space+0x31/0x40 [tipc]
[] filter_rcv+0x31f/0x520 [tipc]
[] ? tipc_sk_lookup+0xc9/0x110 [tipc]
[] ? _raw_spin_lock_bh+0x19/0x30
[] tipc_sk_rcv+0x2dc/0x3e0 [tipc]
[] tipc_bclink_wakeup_users+0x2f/0x40 [tipc]
[] tipc_node_unlock+0x186/0x190 [tipc]
[] ? kfree_skb+0x2c/0x40
[] tipc_rcv+0x2ac/0x8c0 [tipc]
[] tipc_l2_rcv_msg+0x38/0x50 [tipc]
[] __netif_receive_skb_core+0x5a3/0x950
[] __netif_receive_skb+0x13/0x60
[] netif_receive_skb_internal+0x1e/0x90
[] napi_gro_receive+0x78/0xa0
[] tg3_poll_work+0xc54/0xf40 [tg3]
[] ? consume_skb+0x2c/0x40
[] tg3_poll_msix+0x41/0x160 [tg3]
[] net_rx_action+0xe2/0x290
[] __do_softirq+0xda/0x1f0
[] irq_exit+0x76/0xa0
[] do_IRQ+0x55/0xf0
[] common_interrupt+0x6b/0x6b
The issue occurs only when tipc_sk_rcv() is used to wake up postponed
senders:tipc_bclink_wakeup_users()
// wakeupq - is a queue which consists of special
// messages with SOCK_WAKEUP type.
tipc_sk_rcv(wakeupq)
...
while (skb_queue_len(inputq)) {
filter_rcv(skb)
// Here the type of message is checked
// and if it is SOCK_WAKEUP then
// it tries to wake up a sender.
tipc_write_space(sk)
wake_up_interruptible_sync_poll()
}After the sender thread is woke up it can gather control and perform
an attempt to send a message. But if there is no enough place in send
queue it will call link_schedule_user() function which puts a message
of type SOCK_WAKEUP to the wakeup queue and put the sender to sleep.
Thus the size of the queue actually is not changed and the while()
loop never exits.The approach I proposed is to wake up only senders for which there is
enough place in send queue so the described issue can't occur.
Moreover the same approach is already used to wake up senders on
unicast links.I have got into the issue on our product code but to reproduce the
issue I changed a benchmark test application (from
tipcutils/demos/benchmark) to perform the following scenario:
1. Run 64 instances of test application (nodes). It can be done
on the one physical machine.
2. Each application connects to all other using TIPC sockets in
RDM mode.
3. When setup is done all nodes start simultaneously send
broadcast messages.
4. Everything hangs up.The issue is reproducible only when a congestion on broadcast link
occurs. For example, when there are only 8 nodes it works fine since
congestion doesn't occur. Send queue limit is 40 in my case (I use a
critical importance level) and when 64 nodes send a message at the
same moment a congestion occurs every time.Signed-off-by: Dmitry S Kolmakov
Reviewed-by: Jon Maloy
Acked-by: Ying Xue
Signed-off-by: David S. Miller
21 Jul, 2015
3 commits
-
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller -
The status flag LINK_STOPPED is not needed any more, since the
mechanism for delayed deletion of links has been removed.
Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
because we can just as well start the link timer directly from
inside tipc_link_create().We eliminate these flags in this commit.
Instead of the above flags, we now introduce three new link modes,
TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
blocks timer-driven protocol messages to be sent out, and any change
to the link FSM. Since the modes are mutually exclusive, we convert
them to state values, and rename the 'flags' field in struct tipc_link
to 'exec_mode'.Finally, we move the #defines for link FSM states and events from link.h
into enums inside the file link.c, which is the real usage scope of
these definitions.Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller -
When the function tipc_link_xmit() is given a buffer list for
transmission, it currently consumes the list both when transmission
is successful and when it fails, except for the special case when
it encounters link congestion.This behavior is inconsistent, and needs to be corrected if we want
to avoid problems in later commits in this series.In this commit, we change this to let the function consume the list
only when transmission is successful, and leave the list with the
sender in all other cases. We also modifiy the socket code so that
it adapts to this change, i.e., purges the list when a non-congestion
error code is returned.Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller