Eric Lee / smarc-fsl-linux-kernel

10 Dec, 2015

2 commits

ef3ab7c8e tipc: linearize arriving NAME_DISTR and LINK_PROTO buffers ... Browse Code »

[ Upstream commit 5cbb28a4bf65c7e4daa6c25b651fed8eb888c620 ]

Testing of the new UDP bearer has revealed that reception of
NAME_DISTRIBUTOR, LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE
message buffers is not prepared for the case that those may be
non-linear.

We now linearize all such buffers before they are delivered up to the
generic reception layer.

In order for the commit to apply cleanly to 'net' and 'stable', we do
the change in the function tipc_udp_recv() for now. Later, we will post
a commit to 'net-next' moving the linearization to generic code, in
tipc_named_rcv() and tipc_link_proto_rcv().

Fixes: commit d0f91938bede ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jon Paul Maloy
2015-12-10 03:03:11 +0800
d45ed6c1f tipc: allow non-linear first fragment buffer ... Browse Code »

[ Upstream commit 45c8b7b175ceb2d542e0fe15247377bf3bce29ec ]

The current code for message reassembly is erroneously assuming that
the the first arriving fragment buffer always is linear, and then goes
ahead resetting the fragment list of that buffer in anticipation of
more arriving fragments.

However, if the buffer already happens to be non-linear, we will
inadvertently drop the already attached fragment list, and later
on trig a BUG() in __pskb_pull_tail().

We see this happen when running fragmented TIPC multicast across UDP,
something made possible since
commit d0f91938bede ("tipc: add ip/udp media type")

We fix this by not resetting the fragment list when the buffer is non-
linear, and by initiatlizing our private fragment list tail pointer to
the tail of the existing fragment list.

Fixes: commit d0f91938bede ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jon Paul Maloy
2015-12-10 03:03:10 +0800

27 Oct, 2015

1 commit

7b61554c2 tipc: move fragment importance field to new header position ... Browse Code »

[ Upstream commit dde4b5ae65de659b9ec64bafdde0430459fcb495 ]

In commit e3eea1eb47a ("tipc: clean up handling of message priorities")
we introduced a field in the packet header for keeping track of the
priority of fragments, since this value is not present in the specified
protocol header. Since the value so far only is used at the transmitting
end of the link, we have not yet officially defined it as part of the
protocol.

Unfortunately, the field we use for keeping this value, bits 13-15 in
in word 5, has turned out to be a poor choice; it is already used by the
broadcast protocol for carrying the 'network id' field of the sending
node. Since packet fragments also need to be transported across the
broadcast protocol, the risk of conflict is obvious, and we see this
happen when we use network identities larger than 2^13-1. This has
escaped our testing because we have so far only been using small network
id values.

We now move this field to bits 0-2 in word 9, a field that is guaranteed
to be unused by all involved protocols.

Fixes: e3eea1eb47a ("tipc: clean up handling of message priorities")
Signed-off-by: Jon Maloy
Acked-by: Ying Xue
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jon Paul Maloy
2015-10-27 08:51:51 +0800

30 Sep, 2015

1 commit

bb18cdc8b net/tipc: initialize security state for new connection socket ... Browse Code »

[ Upstream commit fdd75ea8df370f206a8163786e7470c1277a5064 ]

Calling connect() with an AF_TIPC socket would trigger a series
of error messages from SELinux along the lines of:
SELinux: Invalid class 0
type=AVC msg=audit(1434126658.487:34500): avc: denied { }
for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0
tcontext=system_u:object_r:unlabeled_t:s0 tclass=
permissive=0

This was due to a failure to initialize the security state of the new
connection sock by the tipc code, leaving it with junk in the security
class field and an unlabeled secid. Add a call to security_sk_clone()
to inherit the security state from the parent socket.

Reported-by: Tim Shearer
Signed-off-by: Stephen Smalley
Acked-by: Paul Moore
Acked-by: Ying Xue
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Stephen Smalley
2015-09-30 01:26:21 +0800

11 Jun, 2015

1 commit

b3be5e3e7 tipc: disconnect socket directly after probe failure ... Browse Code »

If the TIPC connection timer expires in a probing state, a
self abort message is supposed to be generated and delivered
to the local socket. This is currently broken, and the abort
message is actually sent out to the peer node with invalid
addressing information. This will cause the link to enter
a constant retransmission state and eventually reset.
We fix this by removing the self-abort message creation and
tear down connection immediately instead.

Signed-off-by: Erik Hugne
Reviewed-by: Ying Xue
Reviewed-by: Jon Maloy
Signed-off-by: David S. Miller

Erik Hugne
2015-06-11 13:05:20 +0800

30 Apr, 2015

2 commits

0d699f28e tipc: fix problem with parallel link synchronization mechanism ... Browse Code »

Currently, we try to accumulate arrived packets in the links's
'deferred' queue during the parallel link syncronization phase.

This entails two problems:

- With an unlucky combination of arriving packets the algorithm
may go into a lockstep with the out-of-sequence handling function,
where the synch mechanism is adding a packet to the deferred queue,
while the out-of-sequence handling is retrieving it again, thus
ending up in a loop inside the node_lock scope.

- Even if this is avoided, the link will very often send out
unnecessary protocol messages, in the worst case leading to
redundant retransmissions.

We fix this by just dropping arriving packets on the upcoming link
during the synchronization phase, thus relying on the retransmission
protocol to resolve the situation once the two links have arrived to
a synchronized state.

Reviewed-by: Erik Hugne
Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-04-30 03:08:59 +0800
f2f67390a tipc: remove wrong use of NLM_F_MULTI ... Browse Code »

NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.

Libraries like libnl will wait forever for NLMSG_DONE.

Fixes: 35b9dd7607f0 ("tipc: add bearer get/dump to new netlink api")
Fixes: 7be57fc69184 ("tipc: add link get/dump to new netlink api")
Fixes: 46f15c6794fb ("tipc: add media get/dump to new netlink api")
CC: Richard Alpe
CC: Jon Maloy
CC: Ying Xue
CC: tipc-discussion@lists.sourceforge.net
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2015-04-30 02:59:17 +0800

23 Apr, 2015

3 commits

73a317377 tipc: fix node refcount issue ... Browse Code »

When link statistics is dumped over netlink, we iterate over
the list of peer nodes and append each links statistics to
the netlink msg. In the case where the dump is resumed after
filling up a nlmsg, the node refcnt is decremented without
having been incremented previously which may cause the node
reference to be freed. When this happens, the following
info/stacktrace will be generated, followed by a crash or
undefined behavior.
We fix this by removing the erroneous call to tipc_node_put
inside the loop that iterates over nodes.

[ 384.312303] INFO: trying to register non-static key.
[ 384.313110] the code is fine but needs lockdep annotation.
[ 384.313290] turning off the locking correctness validator.
[ 384.313290] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.0.0+ #13
[ 384.313290] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 384.313290] ffff88003c6d0290 ffff88003cc03ca8 ffffffff8170adf1 0000000000000007
[ 384.313290] ffffffff82728730 ffff88003cc03d38 ffffffff810a6a6d 00000000001d7200
[ 384.313290] ffff88003c6d0ab0 ffff88003cc03ce8 0000000000000285 0000000000000001
[ 384.313290] Call Trace:
[ 384.313290] [] dump_stack+0x4c/0x65
[ 384.313290] [] __lock_acquire+0xf3d/0xf50
[ 384.313290] [] lock_acquire+0xd5/0x290
[ 384.313290] [] ? link_timeout+0x1c/0x170 [tipc]
[ 384.313290] [] ? link_state_event+0x4e0/0x4e0 [tipc]
[ 384.313290] [] _raw_spin_lock_bh+0x40/0x80
[ 384.313290] [] ? link_timeout+0x1c/0x170 [tipc]
[ 384.313290] [] link_timeout+0x1c/0x170 [tipc]
[ 384.313290] [] call_timer_fn+0xb8/0x490
[ 384.313290] [] ? process_timeout+0x10/0x10
[ 384.313290] [] run_timer_softirq+0x21c/0x420
[ 384.313290] [] ? link_state_event+0x4e0/0x4e0 [tipc]
[ 384.313290] [] __do_softirq+0xf4/0x630
[ 384.313290] [] irq_exit+0x5d/0x60
[ 384.313290] [] smp_apic_timer_interrupt+0x41/0x50
[ 384.313290] [] apic_timer_interrupt+0x70/0x80
[ 384.313290] [] ? default_idle+0x20/0x210
[ 384.313290] [] ? default_idle+0x1e/0x210
[ 384.313290] [] arch_cpu_idle+0xa/0x10
[ 384.313290] [] cpu_startup_entry+0x2c3/0x530
[ 384.313290] [] ? clockevents_register_device+0x113/0x200
[ 384.313290] [] start_secondary+0x13f/0x170

Fixes: 8a0f6ebe8494 ("tipc: involve reference counter for node structure")
Signed-off-by: Erik Hugne
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Erik Hugne
2015-04-23 23:50:34 +0800
9871b27f6 tipc: fix random link reset problem ... Browse Code »

In the function tipc_sk_rcv(), the stack variable 'err'
is only initialized to TIPC_ERR_NO_PORT for the first
iteration over the link input queue. If a chain of messages
are received from a link, failure to lookup the socket for
any but the first message will cause the message to bounce back
out on a random link.
We fix this by properly initializing err.

Signed-off-by: Erik Hugne
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Erik Hugne
2015-04-23 23:50:34 +0800
def81f69b tipc: fix topology server broken issue ... Browse Code »

When a new topology server is launched in a new namespace, its
listening socket is inserted into the "init ns" namespace's socket
hash table rather than the one owned by the new namespace. Although
the socket's namespace is forcedly changed to the new namespace later,
the socket is still stored in the socket hash table of "init ns"
namespace. When a client created in the new namespace connects
its own topology server, the connection is failed as its server's
socket could not be found from its own namespace's socket table.

If __sock_create() instead of original sock_create_kern() is used
to create the server's socket through specifying an expected namesapce,
the socket will be inserted into the specified namespace's socket
table, thereby avoiding to the topology server broken issue.

Fixes: 76100a8a64bc ("tipc: fix netns refcnt leak")

Reported-by: Erik Hugne
Signed-off-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Ying Xue
2015-04-23 23:50:34 +0800

08 Apr, 2015

1 commit

79b16aade udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb(). ... Browse Code »

That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.

Based upon a patch by Jiri Pirko.

Signed-off-by: David S. Miller
Acked-by: Hannes Frederic Sowa

David Miller
2015-04-08 03:29:08 +0800

03 Apr, 2015

4 commits

ed193ece2 tipc: simplify link mtu negotiation ... Browse Code »

When a link is being established, the two endpoints advertise their
respective interface MTU in the transmitted RESET and ACTIVATE messages.
If there is any difference, the lower of the two MTUs will be selected
for use by both endpoints.

However, as a remnant of earlier attempts to introduce TIPC level
routing. there also exists an MTU discovery mechanism. If an intermediate
node has a lower MTU than the two endpoints, they will discover this
through a bisectional approach, and finally adopt this MTU for common use.

Since there is no TIPC level routing, and probably never will be,
this mechanism doesn't make any sense, and only serves to make the
link level protocol unecessarily complex.

In this commit, we eliminate the MTU discovery algorithm,and fall back
to the simple MTU advertising approach. This change is fully backwards
compatible.

Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-04-03 04:27:12 +0800
dff29b1a8 tipc: eliminate delayed link deletion at link failover ... Browse Code »

When a bearer is disabled manually, all its links have to be reset
and deleted. However, if there is a remaining, parallel link ready
to take over a deleted link's traffic, we currently delay the delete
of the removed link until the failover procedure is finished. This
is because the remaining link needs to access state from the reset
link, such as the last received packet number, and any partially
reassembled buffer, in order to perform a successful failover.

In this commit, we do instead move the state data over to the new
link, so that it can fulfill the procedure autonomously, without
accessing any data on the old link. This means that we can now
proceed and delete all pertaining links immediately when a bearer
is disabled. This saves us from some unnecessary complexity in such
situations.

We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
FAILOVER_MSG and SYNCH_MSG respectively.

Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-04-03 04:27:12 +0800
2da714251 tipc: drop tunneled packet duplicates at reception ... Browse Code »

In commit 8b4ed8634f8b3f9aacfc42b4a872d30c36b9e255
("tipc: eliminate race condition at dual link establishment")
we introduced a parallel link synchronization mechanism that
guarentees sequential delivery even for users switching from
an old to a newly established link. The new mechanism makes it
unnecessary to deliver the tunneled duplicate packets back to
the old link, as we are currently doing. It is now sufficient
to use the last tunneled packet's inner sequence number as
synchronization point between the two parallel links, whereafter
it can be dropped.

In this commit, we drop the duplicate packets arriving on the new
link, after updating the synchronization point at each new arrival.

Although it would now have been sufficient for the other endpoint
to only tunnel the last packet in its send queue, and not the
entire queue, we must still do this to maintain compatibility
with older nodes.

This commit makes it possible to get rid if some complex
interaction between the two parallel links.

Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-04-03 04:27:12 +0800
9f0d34bc3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/usb/asix_common.c
drivers/net/usb/sr9800.c
drivers/net/usb/usbnet.c
include/linux/usb/usbnet.h
net/ipv4/tcp_ipv4.c
net/ipv6/tcp_ipv6.c

The TCP conflicts were overlapping changes. In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.

With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.

Signed-off-by: David S. Miller

David S. Miller
2015-04-03 04:16:53 +0800

01 Apr, 2015

1 commit

7e4369057 tipc: fix a slab object leak ... Browse Code »

When remove TIPC module, there is a warning to remind us that a slab
object is leaked like:

root@localhost:~# rmmod tipc
[ 19.056226] =============================================================================
[ 19.057549] BUG TIPC (Not tainted): Objects remaining in TIPC on kmem_cache_close()
[ 19.058736] -----------------------------------------------------------------------------
[ 19.058736]
[ 19.060287] INFO: Slab 0xffffea0000519a00 objects=23 used=1 fp=0xffff880014668b00 flags=0x100000000004080
[ 19.061915] INFO: Object 0xffff880014668000 @offset=0
[ 19.062717] kmem_cache_destroy TIPC: Slab cache still has objects

This is because the listening socket of TIPC topology server is not
closed before TIPC proto handler is unregistered with proto_unregister().
However, as the socket is closed in tipc_exit_net() which is called by
unregister_pernet_subsys() during unregistering TIPC namespace operation,
the warning can be eliminated if calling unregister_pernet_subsys() is
moved before calling proto_unregister().

Fixes: e05b31f4bf89 ("tipc: make tipc socket support net namespace")
Reviewed-by: Erik Hugne
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2015-04-01 11:10:08 +0800

30 Mar, 2015

3 commits

d482994fc tipc: fix two bugs in secondary destination lookup ... Browse Code »

A message sent to a node after a successful name table lookup may still
find that the destination socket has disappeared, because distribution
of name table updates is non-atomic. If so, the message will be rejected
back to the sender with error code TIPC_ERR_NO_PORT. If the source
socket of the message has disappeared in the meantime, the message
should be dropped.

However, in the currrent code, the message will instead be subject to an
unwanted tertiary lookup, because the function tipc_msg_lookup_dest()
doesn't check if there is an error code present in the message before
performing the lookup. In the worst case, the message may now find the
old destination again, and be redirected once more, instead of being
dropped directly as it should be.

A second bug in this function is that the "prev_node" field in the message
is not updated after successful lookup, something that may have
unpredictable consequences.

The problems arising from those bugs occur very infrequently.

The third change in this function; the test on msg_reroute_msg_cnt() is
purely cosmetic, reflecting that the returned value never can be negative.

This commit corrects the two bugs described above.

Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-03-30 04:47:36 +0800
8a0f6ebe8 tipc: involve reference counter for node structure ... Browse Code »

TIPC node hash node table is protected with rcu lock on read side.
tipc_node_find() is used to look for a node object with node address
through iterating the hash node table. As the entire process of what
tipc_node_find() traverses the table is guarded with rcu read lock,
it's safe for us. However, when callers use the node object returned
by tipc_node_find(), there is no rcu read lock applied. Therefore,
this is absolutely unsafe for callers of tipc_node_find().

Now we introduce a reference counter for node structure. Before
tipc_node_find() returns node object to its caller, it first increases
the reference counter. Accordingly, after its caller used it up,
it decreases the counter again. This can prevent a node being used by
one thread from being freed by another thread.

Reviewed-by: Erik Hugne
Reviewed-by: Jon Maloy
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2015-03-30 03:40:28 +0800
b952b2bef tipc: fix potential deadlock when all links are reset ... Browse Code »

[ 60.988363] ======================================================
[ 60.988754] [ INFO: possible circular locking dependency detected ]
[ 60.989152] 3.19.0+ #194 Not tainted
[ 60.989377] -------------------------------------------------------
[ 60.989781] swapper/3/0 is trying to acquire lock:
[ 60.990079] (&(&n_ptr->lock)->rlock){+.-...}, at: [] tipc_link_retransmit+0x1aa/0x240 [tipc]
[ 60.990743]
[ 60.990743] but task is already holding lock:
[ 60.991106] (&(&bclink->lock)->rlock){+.-...}, at: [] tipc_bclink_lock+0x8e/0xa0 [tipc]
[ 60.991738]
[ 60.991738] which lock already depends on the new lock.
[ 60.991738]
[ 60.992174]
[ 60.992174] the existing dependency chain (in reverse order) is:
[ 60.992174]
-> #1 (&(&bclink->lock)->rlock){+.-...}:
[ 60.992174] [] lock_acquire+0x9c/0x140
[ 60.992174] [] _raw_spin_lock_bh+0x3f/0x50
[ 60.992174] [] tipc_bclink_lock+0x8e/0xa0 [tipc]
[ 60.992174] [] tipc_bclink_add_node+0x97/0xf0 [tipc]
[ 60.992174] [] tipc_node_link_up+0xf5/0x110 [tipc]
[ 60.992174] [] link_state_event+0x2b3/0x4f0 [tipc]
[ 60.992174] [] tipc_link_proto_rcv+0x24c/0x418 [tipc]
[ 60.992174] [] tipc_rcv+0x827/0xac0 [tipc]
[ 60.992174] [] tipc_l2_rcv_msg+0x73/0xd0 [tipc]
[ 60.992174] [] __netif_receive_skb_core+0x746/0x980
[ 60.992174] [] __netif_receive_skb+0x21/0x70
[ 60.992174] [] netif_receive_skb_internal+0x35/0x130
[ 60.992174] [] napi_gro_receive+0x158/0x1d0
[ 60.992174] [] e1000_clean_rx_irq+0x155/0x490
[ 60.992174] [] e1000_clean+0x267/0x990
[ 60.992174] [] net_rx_action+0x150/0x360
[ 60.992174] [] __do_softirq+0x123/0x360
[ 60.992174] [] irq_exit+0x8e/0xb0
[ 60.992174] [] do_IRQ+0x65/0x110
[ 60.992174] [] ret_from_intr+0x0/0x13
[ 60.992174] [] arch_cpu_idle+0xf/0x20
[ 60.992174] [] cpu_startup_entry+0x2f6/0x3f0
[ 60.992174] [] start_secondary+0x13a/0x150
[ 60.992174]
-> #0 (&(&n_ptr->lock)->rlock){+.-...}:
[ 60.992174] [] __lock_acquire+0x163d/0x1ca0
[ 60.992174] [] lock_acquire+0x9c/0x140
[ 60.992174] [] _raw_spin_lock_bh+0x3f/0x50
[ 60.992174] [] tipc_link_retransmit+0x1aa/0x240 [tipc]
[ 60.992174] [] tipc_bclink_rcv+0x611/0x640 [tipc]
[ 60.992174] [] tipc_rcv+0x616/0xac0 [tipc]
[ 60.992174] [] tipc_l2_rcv_msg+0x73/0xd0 [tipc]
[ 60.992174] [] __netif_receive_skb_core+0x746/0x980
[ 60.992174] [] __netif_receive_skb+0x21/0x70
[ 60.992174] [] netif_receive_skb_internal+0x35/0x130
[ 60.992174] [] napi_gro_receive+0x158/0x1d0
[ 60.992174] [] e1000_clean_rx_irq+0x155/0x490
[ 60.992174] [] e1000_clean+0x267/0x990
[ 60.992174] [] net_rx_action+0x150/0x360
[ 60.992174] [] __do_softirq+0x123/0x360
[ 60.992174] [] irq_exit+0x8e/0xb0
[ 60.992174] [] do_IRQ+0x65/0x110
[ 60.992174] [] ret_from_intr+0x0/0x13
[ 60.992174] [] arch_cpu_idle+0xf/0x20
[ 60.992174] [] cpu_startup_entry+0x2f6/0x3f0
[ 60.992174] [] start_secondary+0x13a/0x150
[ 60.992174]
[ 60.992174] other info that might help us debug this:
[ 60.992174]
[ 60.992174] Possible unsafe locking scenario:
[ 60.992174]
[ 60.992174] CPU0 CPU1
[ 60.992174] ---- ----
[ 60.992174] lock(&(&bclink->lock)->rlock);
[ 60.992174] lock(&(&n_ptr->lock)->rlock);
[ 60.992174] lock(&(&bclink->lock)->rlock);
[ 60.992174] lock(&(&n_ptr->lock)->rlock);
[ 60.992174]
[ 60.992174] *** DEADLOCK ***
[ 60.992174]
[ 60.992174] 3 locks held by swapper/3/0:
[ 60.992174] #0: (rcu_read_lock){......}, at: [] __netif_receive_skb_core+0x71/0x980
[ 60.992174] #1: (rcu_read_lock){......}, at: [] tipc_l2_rcv_msg+0x5/0xd0 [tipc]
[ 60.992174] #2: (&(&bclink->lock)->rlock){+.-...}, at: [] tipc_bclink_lock+0x8e/0xa0 [tipc]
[ 60.992174]

The correct the sequence of grabbing n_ptr->lock and bclink->lock
should be that the former is first held and the latter is then taken,
which exactly happened on CPU1. But especially when the retransmission
of broadcast link is failed, bclink->lock is first held in
tipc_bclink_rcv(), and n_ptr->lock is taken in link_retransmit_failure()
called by tipc_link_retransmit() subsequently, which is demonstrated on
CPU0. As a result, deadlock occurs.

If the order of holding the two locks happening on CPU0 is reversed, the
deadlock risk will be relieved. Therefore, the node lock taken in
link_retransmit_failure() originally is moved to tipc_bclink_rcv()
so that it's obtained before bclink lock. But the precondition of
the adjustment of node lock is that responding to bclink reset event
must be moved from tipc_bclink_unlock() to tipc_node_unlock().

Reviewed-by: Erik Hugne
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2015-03-30 03:40:27 +0800

26 Mar, 2015

3 commits

8b4ed8634 tipc: eliminate race condition at dual link establishment ... Browse Code »

Despite recent improvements, the establishment of dual parallel
links still has a small glitch where messages can bypass each
other. When the second link in a dual-link configuration is
established, part of the first link's traffic will be steered over
to the new link. Although we do have a mechanism to ensure that
packets sent before and after the establishment of the new link
arrive in sequence to the destination node, this is not enough.
The arriving messages will still be delivered upwards in different
threads, something entailing a risk of message disordering during
the transition phase.

To fix this, we introduce a synchronization mechanism between the
two parallel links, so that traffic arriving on the new link cannot
be added to its input queue until we are guaranteed that all
pre-establishment messages have been delivered on the old, parallel
link.

This problem seems to always have been around, but its occurrence is
so rare that it has not been noticed until recent intensive testing.

Reviewed-by: Ying Xue
Reviewed-by: Erik Hugne
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-03-26 02:05:56 +0800
3127a0200 tipc: clean up handling of link congestion ... Browse Code »

After the recent changes in message importance handling it becomes
possible to simplify handling of messages and sockets when we
encounter link congestion.

We merge the function tipc_link_cong() into link_schedule_user(),
and simplify the code of the latter. The code should now be
easier to follow, especially regarding return codes and handling
of the message that caused the situation.

In case the scheduling function is unable to pre-allocate a wakeup
message buffer, it now returns -ENOBUFS, which is a more correct
code than the previously used -EHOSTUNREACH.

Reviewed-by: Ying Xue
Reviewed-by: Erik Hugne
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-03-26 02:05:56 +0800
1f66d161a tipc: introduce starvation free send algorithm ... Browse Code »

Currently, we only use a single counter; the length of the backlog
queue, to determine whether a message should be accepted to the queue
or not. Each time a message is being sent, the queue length is compared
to a threshold value for the message's importance priority. If the queue
length is beyond this threshold, the message is rejected. This algorithm
implies a risk of starvation of low importance senders during very high
load, because it may take a long time before the backlog queue has
decreased enough to accept a lower level message.

We now eliminate this risk by introducing a counter for each importance
priority. When a message is sent, we check only the queue level for that
particular message's priority. If that is ok, the message can be added
to the backlog, irrespective of the queue level for other priorities.
This way, each level is guaranteed a certain portion of the total
bandwidth, and any risk of starvation is eliminated.

Reviewed-by: Ying Xue
Reviewed-by: Erik Hugne
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-03-26 02:05:56 +0800

25 Mar, 2015

4 commits

bc14b8d6a tipc: fix a link reset issue due to retransmission failures ... Browse Code »

When a node joins a cluster while we are transmitting a fragment
stream over the broadcast link, it's missing the preceding fragments
needed to build a meaningful message. As a result, the node has to
drop it. However, as the fragment message is not acknowledged to
its sender before it's dropped, it accidentally causes link reset
of retransmission failure on the node.

Reported-by: Erik Hugne
Signed-off-by: Ying Xue
Reviewed-by: Erik Hugne
Tested-by: Erik Hugne
Signed-off-by: David S. Miller

Ying Xue
2015-03-25 23:43:32 +0800
b5e2c150a rhashtable: Disable automatic shrinking by default ... Browse Code »

Introduce a new bool automatic_shrinking to require the
user to explicitly opt-in to automatic shrinking of tables.

Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller

Thomas Graf
2015-03-25 05:48:40 +0800
ed3e852aa tipc: fix compile error when IPV6=m and TIPC=y ... Browse Code »

When IPV6=m and TIPC=y, below error will appear during building kernel
image:

net/tipc/udp_media.c:196:
undefined reference to `ip6_dst_lookup'
make: *** [vmlinux] Error 1

As ip6_dst_lookup() is implemented in IPV6 and IPV6 is compiled as
module, ip6_dst_lookup() is not built-in core kernel image. As a
result, compiler cannot find 'ip6_dst_lookup' reference while
compiling TIPC code into core kernel image.

But with the method introduced by commit 5f81bd2e5d80 ("ipv6: export a
stub for IPv6 symbols used by vxlan"), we can avoid the compile error
through "ipv6_stub" pointer to access ip6_dst_lookup().

Fixes: d0f91938bede ("tipc: add ip/udp media type")
Suggested-by: Marcelo Ricardo Leitner
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2015-03-25 03:20:29 +0800
610600c8c tipc: validate length of sockaddr in connect() for dgram/rdm ... Browse Code »

Commit f2f8036 ("tipc: add support for connect() on dgram/rdm sockets")
hasn't validated user input length for the sockaddr structure which allows
a user to overwrite kernel memory with arbitrary input.

Fixes: f2f8036 ("tipc: add support for connect() on dgram/rdm sockets")
Signed-off-by: Sasha Levin
Acked-by: Ying Xue
Signed-off-by: David S. Miller

Sasha Levin
2015-03-25 00:50:39 +0800

24 Mar, 2015

1 commit

6d0229498 tipc: Use default rhashtable hashfn ... Browse Code »

This patch removes the explicit jhash value for the hashfn parameter
of rhashtable. The default is now jhash so removing the setting
makes no difference apart from making one less copy of jhash in
the kernel.

Signed-off-by: Herbert Xu
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Herbert Xu
2015-03-24 10:07:51 +0800

21 Mar, 2015

1 commit

6cca7289d tipc: Use inlined rhashtable interface ... Browse Code »

This patch converts tipc to the inlined rhashtable interface.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-03-21 04:16:24 +0800

20 Mar, 2015

4 commits

446981e5f tipc: fix build issue when building without IPv6 ... Browse Code »

We can't directly call ipv6_sock_mc_join() but should use the stub
instead and protect it around IS_ENABLED.

Fixes: d0f91938bede ("tipc: add ip/udp media type")
Signed-off-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2015-03-20 04:06:27 +0800
f2f8036e3 tipc: add support for connect() on dgram/rdm sockets ... Browse Code »

Following the example of ip4_datagram_connect, we store the
address in the socket structure for dgram/rdm sockets and use
that as the default destination for subsequent send() calls.
It is allowed to connect to any address types, and the behaviour
of send() will be the same as a normal sendto() with this address
provided. Binding to an AF_UNSPEC address clears the association.

Signed-off-by: Erik Hugne
Reviewed-by: Ying Xue
Reviewed-by: Jon Maloy
Signed-off-by: David S. Miller

Erik Hugne
2015-03-20 00:25:54 +0800
3bd88ee7a tipc: do not report -EHOSTUNREACH for failed local delivery ... Browse Code »

Since commit 1186adf7df04 ("tipc: simplify message forwarding and
rejection in socket layer") -EHOSTUNREACH is propagated back to
the sending process if we fail to deliver the message to another
socket local to the node.
This is wrong, host unreachable should only be reported when the
destination port/name does not exist in the cluster, and that
check is always done before sending the message. Also, this
introduces inconsistent sendmsg() behavior for local/remote
destinations. Errors occurring on the receiving side should not
trickle up to the sender. If message delivery fails TIPC should
either discard the packet or reject it back to the sender based
on the destination droppable option.

Signed-off-by: Erik Hugne
Acked-by: Ying Xue
Acked-by: Jon Maloy
Signed-off-by: David S. Miller

Erik Hugne
2015-03-20 00:25:54 +0800
18d6c5841 tipc: remove redundant call to tipc_node_remove_conn ... Browse Code »

tipc_node_remove_conn may be called twice if shutdown() is
called on a socket that have messages in the receive queue.
Calling this function twice does no harm, but is unnecessary
and we remove the redundant call.

Signed-off-by: Erik Hugne
Acked-by: Ying Xue
Acked-by: Jon Maloy
Signed-off-by: David S. Miller

Erik Hugne
2015-03-20 00:25:54 +0800

19 Mar, 2015

2 commits

54ff9ef36 ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop} ... Browse Code »

in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

Signed-off-by: Marcelo Ricardo Leitner
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2015-03-19 10:05:09 +0800
446c89ac1 tipc: Use rhashtable max/min_size instead of max/min_shift ... Browse Code »

This patch converts tipc to use rhashtable max/min_size instead of
the obsolete max/min_shift.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-03-19 00:46:40 +0800

18 Mar, 2015

3 commits

2b9bb7f33 tipc: withdraw tipc topology server name when namespace is deleted ... Browse Code »

The TIPC topology server is a per namespace service associated with the
tipc name {1, 1}. When a namespace is deleted, that name must be withdrawn
before we call sk_release_kernel because the kernel socket release is
done in init_net and trying to withdraw a TIPC name published in another
namespace will fail with an error as:

[ 170.093264] Unable to remove local publication
[ 170.093264] (type=1, lower=1, ref=2184244004, key=2184244005)

We fix this by breaking the association between the topology server name
and socket before calling sk_release_kernel.

Signed-off-by: Ying Xue
Reviewed-by: Erik Hugne
Signed-off-by: David S. Miller

Ying Xue
2015-03-18 10:11:26 +0800
8460504bd tipc: fix a potential deadlock when nametable is purged ... Browse Code »

[ 28.531768] =============================================
[ 28.532322] [ INFO: possible recursive locking detected ]
[ 28.532322] 3.19.0+ #194 Not tainted
[ 28.532322] ---------------------------------------------
[ 28.532322] insmod/583 is trying to acquire lock:
[ 28.532322] (&(&nseq->lock)->rlock){+.....}, at: [] tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[ 28.532322]
[ 28.532322] but task is already holding lock:
[ 28.532322] (&(&nseq->lock)->rlock){+.....}, at: [] tipc_nametbl_stop+0xfc/0x1f0 [tipc]
[ 28.532322]
[ 28.532322] other info that might help us debug this:
[ 28.532322] Possible unsafe locking scenario:
[ 28.532322]
[ 28.532322] CPU0
[ 28.532322] ----
[ 28.532322] lock(&(&nseq->lock)->rlock);
[ 28.532322] lock(&(&nseq->lock)->rlock);
[ 28.532322]
[ 28.532322] *** DEADLOCK ***
[ 28.532322]
[ 28.532322] May be due to missing lock nesting notation
[ 28.532322]
[ 28.532322] 3 locks held by insmod/583:
[ 28.532322] #0: (net_mutex){+.+.+.}, at: [] register_pernet_subsys+0x1f/0x50
[ 28.532322] #1: (&(&tn->nametbl_lock)->rlock){+.....}, at: [] tipc_nametbl_stop+0xb1/0x1f0 [tipc]
[ 28.532322] #2: (&(&nseq->lock)->rlock){+.....}, at: [] tipc_nametbl_stop+0xfc/0x1f0 [tipc]
[ 28.532322]
[ 28.532322] stack backtrace:
[ 28.532322] CPU: 1 PID: 583 Comm: insmod Not tainted 3.19.0+ #194
[ 28.532322] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 28.532322] ffffffff82394460 ffff8800144cb928 ffffffff81792f3e 0000000000000007
[ 28.532322] ffffffff82394460 ffff8800144cba28 ffffffff810a8080 ffff8800144cb998
[ 28.532322] ffffffff810a4df3 ffff880013e9cb10 ffffffff82b0d330 ffff880013e9cb38
[ 28.532322] Call Trace:
[ 28.532322] [] dump_stack+0x4c/0x65
[ 28.532322] [] __lock_acquire+0x740/0x1ca0
[ 28.532322] [] ? __bfs+0x23/0x270
[ 28.532322] [] ? check_irq_usage+0x96/0xe0
[ 28.532322] [] ? __lock_acquire+0x1133/0x1ca0
[ 28.532322] [] ? tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[ 28.532322] [] lock_acquire+0x9c/0x140
[ 28.532322] [] ? tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[ 28.532322] [] _raw_spin_lock_bh+0x3f/0x50
[ 28.532322] [] ? tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[ 28.532322] [] tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[ 28.532322] [] tipc_nametbl_stop+0x13e/0x1f0 [tipc]
[ 28.532322] [] ? tipc_nametbl_stop+0x5/0x1f0 [tipc]
[ 28.532322] [] tipc_init_net+0x13b/0x150 [tipc]
[ 28.532322] [] ? tipc_init_net+0x5/0x150 [tipc]
[ 28.532322] [] ops_init+0x4e/0x150
[ 28.532322] [] ? trace_hardirqs_on+0xd/0x10
[ 28.532322] [] register_pernet_operations+0xf3/0x190
[ 28.532322] [] register_pernet_subsys+0x2e/0x50
[ 28.532322] [] tipc_init+0x6a/0x1000 [tipc]
[ 28.532322] [] ? 0xffffffffa0024000
[ 28.532322] [] do_one_initcall+0x89/0x1c0
[ 28.532322] [] ? kmem_cache_alloc_trace+0x50/0x1b0
[ 28.532322] [] ? do_init_module+0x2b/0x200
[ 28.532322] [] do_init_module+0x64/0x200
[ 28.532322] [] load_module+0x12f3/0x18e0
[ 28.532322] [] ? show_initstate+0x50/0x50
[ 28.532322] [] SyS_init_module+0xd9/0x110
[ 28.532322] [] sysenter_dispatch+0x7/0x1f

Before tipc_purge_publications() calls tipc_nametbl_remove_publ() to
remove a publication with a name sequence, the name sequence's lock
is held. However, when tipc_nametbl_remove_publ() calling
tipc_nameseq_remove_publ() to remove the publication, it first tries
to query name sequence instance with the publication, and then holds
the lock of the found name sequence. But as the lock may be already
taken in tipc_purge_publications(), deadlock happens like above
scenario demonstrated. As tipc_nameseq_remove_publ() doesn't grab name
sequence's lock, the deadlock can be avoided if it's directly invoked
by tipc_purge_publications().

Fixes: 97ede29e80ee ("tipc: convert name table read-write lock to RCU")
Signed-off-by: Ying Xue
Reviewed-by: Erik Hugne
Signed-off-by: David S. Miller

Ying Xue
2015-03-18 10:11:26 +0800
76100a8a6 tipc: fix netns refcnt leak ... Browse Code »

When the TIPC module is loaded, we launch a topology server in kernel
space, which in its turn is creating TIPC sockets for communication
with topology server users. Because both the socket's creator and
provider reside in the same module, it is necessary that the TIPC
module's reference count remains zero after the server is started and
the socket created; otherwise it becomes impossible to perform "rmmod"
even on an idle module.

Currently, we achieve this by defining a separate "tipc_proto_kern"
protocol struct, that is used only for kernel space socket allocations.
This structure has the "owner" field set to NULL, which restricts the
module reference count from being be bumped when sk_alloc() for local
sockets is called. Furthermore, we have defined three kernel-specific
functions, tipc_sock_create_local(), tipc_sock_release_local() and
tipc_sock_accept_local(), to avoid the module counter being modified
when module local sockets are created or deleted. This has worked well
until we introduced name space support.

However, after name space support was introduced, we have observed that
a reference count leak occurs, because the netns counter is not
decremented in tipc_sock_delete_local().

This commit remedies this problem. But instead of just modifying
tipc_sock_delete_local(), we eliminate the whole parallel socket
handling infrastructure, and start using the regular sk_create_kern(),
kernel_accept() and sk_release_kernel() calls. Since those functions
manipulate the module counter, we must now compensate for that by
explicitly decrementing the counter after module local sockets are
created, and increment it just before calling sk_release_kernel().

Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace")
Signed-off-by: Ying Xue
Reviewed-by: Jon Maloy
Reviewed-by: Erik Hugne
Reported-by: Cong Wang
Tested-by: Erik Hugne
Signed-off-by: David S. Miller

Ying Xue
2015-03-18 10:11:26 +0800

15 Mar, 2015

3 commits

e3eea1eb4 tipc: clean up handling of message priorities ... Browse Code »

Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.

There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.

This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.

In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.

This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.

Reviewed-by: Erik Hugne
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-03-15 02:38:32 +0800
05dcc5aa4 tipc: split link outqueue ... Browse Code »

struct tipc_link contains one single queue for outgoing packets,
where both transmitted and waiting packets are queued.

This infrastructure is hard to maintain, because we need
to keep a number of fields to keep track of which packets are
sent or unsent, and the number of packets in each category.

A lot of code becomes simpler if we split this queue into a transmission
queue, where sent/unacknowledged packets are kept, and a backlog queue,
where we keep the not yet sent packets.

In this commit we do this separation.

Reviewed-by: Erik Hugne
Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-03-15 02:38:32 +0800
2cdf3918e tipc: eliminate unnecessary call to broadcast ack function ... Browse Code »

The unicast packet header contains a broadcast acknowledge sequence
number, that may need to be conveyed to the broadcast link for proper
treatment. Currently, the function tipc_rcv(), which is on the most
critical data path, calls the function tipc_bclink_acknowledge() to
have this done. This call is made for each received packet, and results
in the unconditional grabbing of the broadcast link spinlock.

This is unnecessary, since we can see directly from tipc_rcv() if
the acknowledged number differs from what has been previously acked
from the node in question. In the vast majority of cases the numbers
won't differ, and there is nothing to update.

We now make the call to tipc_bclink_acknowledge() conditional
to that the two ack values differ.

Reviewed-by: Erik Hugne
Reviewed-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-03-15 02:38:32 +0800