09 Jul, 2015

1 commit

  • Calling connect() with an AF_TIPC socket would trigger a series
    of error messages from SELinux along the lines of:
    SELinux: Invalid class 0
    type=AVC msg=audit(1434126658.487:34500): avc: denied { }
    for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0
    tcontext=system_u:object_r:unlabeled_t:s0 tclass=
    permissive=0

    This was due to a failure to initialize the security state of the new
    connection sock by the tipc code, leaving it with junk in the security
    class field and an unlabeled secid. Add a call to security_sk_clone()
    to inherit the security state from the parent socket.

    Reported-by: Tim Shearer
    Signed-off-by: Stephen Smalley
    Acked-by: Paul Moore
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Stephen Smalley
     

29 Jun, 2015

1 commit

  • In commit 1f66d161ab3d8b518903fa6c3f9c1f48d6919e74
    ("tipc: introduce starvation free send algorithm")
    we introduced a counter per priority level for buffers
    in the link backlog queue. We also introduced a new
    function tipc_link_purge_backlog(), to reset these
    counters to zero when the link is reset.

    Unfortunately, we missed to call this function when
    the broadcast link is reset, with the result that the
    values of these counters might be permanently skewed
    when new nodes are attached. This may in the worst case
    lead to permananent, but spurious, broadcast link
    congestion, where no broadcast packets can be sent at
    all.

    We fix this bug with this commit.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

14 Jun, 2015

1 commit


11 Jun, 2015

1 commit

  • If the TIPC connection timer expires in a probing state, a
    self abort message is supposed to be generated and delivered
    to the local socket. This is currently broken, and the abort
    message is actually sent out to the peer node with invalid
    addressing information. This will cause the link to enter
    a constant retransmission state and eventually reset.
    We fix this by removing the self-abort message creation and
    tear down connection immediately instead.

    Signed-off-by: Erik Hugne
    Reviewed-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     

31 May, 2015

1 commit


27 May, 2015

1 commit

  • In commit dd3f9e70f59f43a5712eba9cf3ee4f1e6999540c
    ("tipc: add packet sequence number at instant of transmission") we
    made a change with the consequence that packets in the link backlog
    queue don't contain valid sequence numbers.

    However, when we create a link protocol message, we still use the
    sequence number of the first packet in the backlog, if there is any,
    as "next_sent" indicator in the message. This may entail unnecessary
    retransissions or stale packet transmission when there is very low
    traffic on the link.

    This commit fixes this issue by only using the current value of
    tipc_link::snd_nxt as indicator.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

15 May, 2015

9 commits

  • After commit eeb1bd5c40ed ("net: Add a struct net parameter to
    sock_create_kern"), we should use sock_create_kern() to create kernel
    socket as the interface doesn't reference count struct net any more.

    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Currently, the packet sequence number is updated and added to each
    packet at the moment a packet is added to the link backlog queue.
    This is wasteful, since it forces the code to traverse the send
    packet list packet by packet when adding them to the backlog queue.
    It would be better to just splice the whole packet list into the
    backlog queue when that is the right action to do.

    In this commit, we do this change. Also, since the sequence numbers
    cannot now be assigned to the packets at the moment they are added
    the backlog queue, we do instead calculate and add them at the moment
    of transmission, when the backlog queue has to be traversed anyway.
    We do this in the function tipc_link_push_packet().

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The link congestion algorithm used until now implies two problems.

    - It is too generous towards lower-level messages in situations of high
    load by giving "absolute" bandwidth guarantees to the different
    priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
    20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
    available bandwidth. But, in the absence of higher level traffic, the
    ratio between two distinct levels becomes unreasonable. E.g. if there
    is only LOW and MEDIUM traffic on a system, the former is guaranteed
    1/3 of the bandwidth, and the latter 2/3. This again means that if
    there is e.g. one LOW user and 10 MEDIUM users, the former will have
    33.3% of the bandwidth, and the others will have to compete for the
    remainder, i.e. each will end up with 6.7% of the capacity.

    - Packets of type MSG_BUNDLER are created at SYSTEM importance level,
    but only after the packets bundled into it have passed the congestion
    test for their own respective levels. Since bundled packets don't
    result in incrementing the level counter for their own importance,
    only occasionally for the SYSTEM level counter, they do in practice
    obtain SYSTEM level importance. Hence, the current implementation
    provides a gap in the congestion algorithm that in the worst case
    may lead to a link reset.

    We now refine the congestion algorithm as follows:

    - A message is accepted to the link backlog only if its own level
    counter, and all superior level counters, permit it.

    - The importance of a created bundle packet is set according to its
    contents. A bundle packet created from messges at levels LOW to
    CRITICAL is given importance level CRITICAL, while a bundle created
    from a SYSTEM level message is given importance SYSTEM. In the latter
    case only subsequent SYSTEM level messages are allowed to be bundled
    into it.

    This solves the first problem described above, by making the bandwidth
    guarantee relative to the total number of users at all levels; only
    the upper limit for each level remains absolute. In the example
    described above, the single LOW user would use 1/11th of the bandwidth,
    the same as each of the ten MEDIUM users, but he still has the same
    guarantee against starvation as the latter ones.

    The fix also solves the second problem. If the CRITICAL level is filled
    up by bundle packets of that level, no lower level packets will be
    accepted any more.

    Suggested-by: Gergely Kiss
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • We change the sequence number checkpointing that is performed
    by the timer in order to discover if the peer is active. Currently,
    we store a checkpoint of the next expected sequence number "rcv_nxt"
    at each timer expiration, and compare it to the current expected
    number at next timeout expiration. Instead, we now use the already
    existing field "silent_intv_cnt" for this task. We step the counter
    at each timeout expiration, and zero it at each valid received packet.
    If no valid packet has been received from the peer after "abort_limit"
    number of silent timer intervals, the link is declared faulty and reset.

    We also remove the multiple instances of timer activation from inside
    the FSM function "link_state_event()", and now do it at only one place;
    at the end of the timer function itself.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • We rename some fields in struct tipc_link, in order to give them more
    descriptive names:

    next_in_no -> rcv_nxt
    next_out_no-> snd_nxt
    fsm_msg_cnt-> silent_intv_cnt
    cont_intv -> keepalive_intv
    last_retransmitted -> last_retransm

    There are no functional changes in this commit.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Although the sequence number in the TIPC protocol is 16 bits, we have
    until now stored it internally as an unsigned 32 bits integer.
    We got around this by always doing explicit modulo-65535 operations
    whenever we need to access a sequence number.

    We now make the incoming and outgoing sequence numbers to unsigned
    16-bit integers, and remove the modulo operations where applicable.

    We also move the arithmetic inline functions for 16 bit integers
    to core.h, and the function buf_seqno() to msg.h, so they can easily
    be accessed from anywhere in the code.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • When we try to add new inline functions in the code, we sometimes
    run into circular include dependencies.

    The main problem is that the file core.h, which really should be at
    the root of the dependency chain, instead is a leaf. I.e., core.h
    includes a number of header files that themselves should be allowed
    to include core.h. In reality this is unnecessary, because core.h does
    not need to know the full signature of any of the structs it refers to,
    only their type declaration.

    In this commit, we remove all dependencies from core.h towards any
    other tipc header file.

    As a consequence of this change, we can now move the function
    tipc_own_addr(net) from addr.c to addr.h, and make it inline.

    There are no functional changes in this commit.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Prior to this commit, the link timer has been running at a "continuity
    interval" of configured link tolerance/4. When a timer wakes up and
    discovers that there has been no sign of life from the peer during the
    previous interval, it divides its own timer interval by another factor
    four, and starts sending one probe per new interval. When the configured
    link tolerance time has passed without answer, i.e. after 16 unacked
    probes, the link is declared faulty and reset.

    This is unnecessary complex. It is sufficient to continue with the
    original continuity interval, and instead reset the link after four
    missed probe responses. This makes the timer handling in the link
    simpler, and opens up for some planned later changes in this area.
    This commit implements this change.

    Reviewed-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Since commit 4b475e3f2f8e4e241de101c8240f1d74d0470494
    ("tipc: eliminate delayed link deletion at link failover") the extra
    boolean parameter "shutting_down" is not any longer needed for the
    functions bearer_disable() and tipc_link_delete_list().

    Furhermore, the function tipc_link_reset_links(), called from
    bearer_reset() is now unnecessary. We can just as well delete
    all the links, as we do in bearer_disable(), and start over with
    creating new links.

    This commit introduces those changes.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

11 May, 2015

1 commit


10 May, 2015

3 commits

  • The legacy netlink API treated EPERM (permission denied) as
    "operation not supported".

    Reported-by: Tomi Ollila
    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Add the ability to get or set the broadcast link window through the
    new netlink API. The functionality was unintentionally missing from
    the new netlink API. Adding this means that we also fix the breakage
    in the old API when coming through the compat layer.

    Fixes: 37e2d4843f9e (tipc: convert legacy nl link prop set to nl compat)
    Reported-by: Tomi Ollila
    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Default link properties can be set for media or bearer. This
    functionality was missed when introducing the NL compatibility layer.

    This patch implements this functionality in the compat netlink
    layer. It works the same way as it did in the old API. We search for
    media and bearers matching the "link name". If we find a matching
    media or bearer the link tolerance, priority or window is used as
    default for new links on that media or bearer.

    Fixes: 37e2d4843f9e (tipc: convert legacy nl link prop set to nl compat)
    Reported-by: Tomi Ollila
    Signed-off-by: Richard Alpe
    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     

05 May, 2015

5 commits

  • Once tipc_conn_new() returns NULL, the connection should be shut
    down immediately, otherwise, oops may happen due to the NULL pointer.

    Signed-off-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Currently subscriber's lock protects not only subscriber's subscription
    list but also all subscriptions linked into the list. However, as all
    members of subscription are never changed after they are initialized,
    it's unnecessary for subscription to be protected under subscriber's
    lock. If the lock is used to only protect subscriber's subscription
    list, the adjustment not only makes the locking policy simpler, but
    also helps to avoid a deadlock which may happen once creating a
    subscription is failed.

    Signed-off-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • At present subscriber's lock is used to protect the subscription list
    of subscriber as well as subscriptions linked into the list. While one
    or all subscriptions are deleted through iterating the list, the
    subscriber's lock must be held. Meanwhile, as deletion of subscription
    may happen in subscription timer's handler, the lock must be grabbed
    in the function as well. When subscription's timer is terminated with
    del_timer_sync() during above iteration, subscriber's lock has to be
    temporarily released, otherwise, deadlock may occur. However, the
    temporary release may cause the double free of a subscription as the
    subscription is not disconnected from the subscription list.

    Now if a reference counter is introduced to subscriber, subscription's
    timer can be asynchronously stopped with del_timer(). As a result, the
    issue is not only able to be fixed, but also relevant code is pretty
    readable and understandable.

    Signed-off-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Introducing a new function makes the purpose of tipc_subscrb_connect_cb
    callback routine more clear.

    Signed-off-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • When a topology server accepts a connection request from its client,
    it allocates a connection instance and a tipc_subscriber structure
    object. The former is used to communicate with client, and the latter
    is often treated as a subscriber which manages all subscription events
    requested from a same client. When a topology server receives a request
    of subscribing name services from a client through the connection, it
    creates a tipc_subscription structure instance which is seen as a
    subscription recording what name services are subscribed. In order to
    manage all subscriptions from a same client, topology server links
    them into the subscrp_list of the subscriber. So subscriber and
    subscription completely represents different meanings respectively,
    but function names associated with them make us so confused that we
    are unable to easily tell which function is against subscriber and
    which is to subscription. So we want to eliminate the confusion by
    renaming them.

    Signed-off-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     

30 Apr, 2015

2 commits

  • Currently, we try to accumulate arrived packets in the links's
    'deferred' queue during the parallel link syncronization phase.

    This entails two problems:

    - With an unlucky combination of arriving packets the algorithm
    may go into a lockstep with the out-of-sequence handling function,
    where the synch mechanism is adding a packet to the deferred queue,
    while the out-of-sequence handling is retrieving it again, thus
    ending up in a loop inside the node_lock scope.

    - Even if this is avoided, the link will very often send out
    unnecessary protocol messages, in the worst case leading to
    redundant retransmissions.

    We fix this by just dropping arriving packets on the upcoming link
    during the synchronization phase, thus relying on the retransmission
    protocol to resolve the situation once the two links have arrived to
    a synchronized state.

    Reviewed-by: Erik Hugne
    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
    it is sent only at the end of a dump.

    Libraries like libnl will wait forever for NLMSG_DONE.

    Fixes: 35b9dd7607f0 ("tipc: add bearer get/dump to new netlink api")
    Fixes: 7be57fc69184 ("tipc: add link get/dump to new netlink api")
    Fixes: 46f15c6794fb ("tipc: add media get/dump to new netlink api")
    CC: Richard Alpe
    CC: Jon Maloy
    CC: Ying Xue
    CC: tipc-discussion@lists.sourceforge.net
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

23 Apr, 2015

3 commits

  • When link statistics is dumped over netlink, we iterate over
    the list of peer nodes and append each links statistics to
    the netlink msg. In the case where the dump is resumed after
    filling up a nlmsg, the node refcnt is decremented without
    having been incremented previously which may cause the node
    reference to be freed. When this happens, the following
    info/stacktrace will be generated, followed by a crash or
    undefined behavior.
    We fix this by removing the erroneous call to tipc_node_put
    inside the loop that iterates over nodes.

    [ 384.312303] INFO: trying to register non-static key.
    [ 384.313110] the code is fine but needs lockdep annotation.
    [ 384.313290] turning off the locking correctness validator.
    [ 384.313290] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.0.0+ #13
    [ 384.313290] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 384.313290] ffff88003c6d0290 ffff88003cc03ca8 ffffffff8170adf1 0000000000000007
    [ 384.313290] ffffffff82728730 ffff88003cc03d38 ffffffff810a6a6d 00000000001d7200
    [ 384.313290] ffff88003c6d0ab0 ffff88003cc03ce8 0000000000000285 0000000000000001
    [ 384.313290] Call Trace:
    [ 384.313290] [] dump_stack+0x4c/0x65
    [ 384.313290] [] __lock_acquire+0xf3d/0xf50
    [ 384.313290] [] lock_acquire+0xd5/0x290
    [ 384.313290] [] ? link_timeout+0x1c/0x170 [tipc]
    [ 384.313290] [] ? link_state_event+0x4e0/0x4e0 [tipc]
    [ 384.313290] [] _raw_spin_lock_bh+0x40/0x80
    [ 384.313290] [] ? link_timeout+0x1c/0x170 [tipc]
    [ 384.313290] [] link_timeout+0x1c/0x170 [tipc]
    [ 384.313290] [] call_timer_fn+0xb8/0x490
    [ 384.313290] [] ? process_timeout+0x10/0x10
    [ 384.313290] [] run_timer_softirq+0x21c/0x420
    [ 384.313290] [] ? link_state_event+0x4e0/0x4e0 [tipc]
    [ 384.313290] [] __do_softirq+0xf4/0x630
    [ 384.313290] [] irq_exit+0x5d/0x60
    [ 384.313290] [] smp_apic_timer_interrupt+0x41/0x50
    [ 384.313290] [] apic_timer_interrupt+0x70/0x80
    [ 384.313290] [] ? default_idle+0x20/0x210
    [ 384.313290] [] ? default_idle+0x1e/0x210
    [ 384.313290] [] arch_cpu_idle+0xa/0x10
    [ 384.313290] [] cpu_startup_entry+0x2c3/0x530
    [ 384.313290] [] ? clockevents_register_device+0x113/0x200
    [ 384.313290] [] start_secondary+0x13f/0x170

    Fixes: 8a0f6ebe8494 ("tipc: involve reference counter for node structure")
    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • In the function tipc_sk_rcv(), the stack variable 'err'
    is only initialized to TIPC_ERR_NO_PORT for the first
    iteration over the link input queue. If a chain of messages
    are received from a link, failure to lookup the socket for
    any but the first message will cause the message to bounce back
    out on a random link.
    We fix this by properly initializing err.

    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • When a new topology server is launched in a new namespace, its
    listening socket is inserted into the "init ns" namespace's socket
    hash table rather than the one owned by the new namespace. Although
    the socket's namespace is forcedly changed to the new namespace later,
    the socket is still stored in the socket hash table of "init ns"
    namespace. When a client created in the new namespace connects
    its own topology server, the connection is failed as its server's
    socket could not be found from its own namespace's socket table.

    If __sock_create() instead of original sock_create_kern() is used
    to create the server's socket through specifying an expected namesapce,
    the socket will be inserted into the specified namespace's socket
    table, thereby avoiding to the topology server broken issue.

    Fixes: 76100a8a64bc ("tipc: fix netns refcnt leak")

    Reported-by: Erik Hugne
    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     

08 Apr, 2015

1 commit


03 Apr, 2015

4 commits

  • When a link is being established, the two endpoints advertise their
    respective interface MTU in the transmitted RESET and ACTIVATE messages.
    If there is any difference, the lower of the two MTUs will be selected
    for use by both endpoints.

    However, as a remnant of earlier attempts to introduce TIPC level
    routing. there also exists an MTU discovery mechanism. If an intermediate
    node has a lower MTU than the two endpoints, they will discover this
    through a bisectional approach, and finally adopt this MTU for common use.

    Since there is no TIPC level routing, and probably never will be,
    this mechanism doesn't make any sense, and only serves to make the
    link level protocol unecessarily complex.

    In this commit, we eliminate the MTU discovery algorithm,and fall back
    to the simple MTU advertising approach. This change is fully backwards
    compatible.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • When a bearer is disabled manually, all its links have to be reset
    and deleted. However, if there is a remaining, parallel link ready
    to take over a deleted link's traffic, we currently delay the delete
    of the removed link until the failover procedure is finished. This
    is because the remaining link needs to access state from the reset
    link, such as the last received packet number, and any partially
    reassembled buffer, in order to perform a successful failover.

    In this commit, we do instead move the state data over to the new
    link, so that it can fulfill the procedure autonomously, without
    accessing any data on the old link. This means that we can now
    proceed and delete all pertaining links immediately when a bearer
    is disabled. This saves us from some unnecessary complexity in such
    situations.

    We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
    ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
    FAILOVER_MSG and SYNCH_MSG respectively.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • In commit 8b4ed8634f8b3f9aacfc42b4a872d30c36b9e255
    ("tipc: eliminate race condition at dual link establishment")
    we introduced a parallel link synchronization mechanism that
    guarentees sequential delivery even for users switching from
    an old to a newly established link. The new mechanism makes it
    unnecessary to deliver the tunneled duplicate packets back to
    the old link, as we are currently doing. It is now sufficient
    to use the last tunneled packet's inner sequence number as
    synchronization point between the two parallel links, whereafter
    it can be dropped.

    In this commit, we drop the duplicate packets arriving on the new
    link, after updating the synchronization point at each new arrival.

    Although it would now have been sufficient for the other endpoint
    to only tunnel the last packet in its send queue, and not the
    entire queue, we must still do this to maintain compatibility
    with older nodes.

    This commit makes it possible to get rid if some complex
    interaction between the two parallel links.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Conflicts:
    drivers/net/usb/asix_common.c
    drivers/net/usb/sr9800.c
    drivers/net/usb/usbnet.c
    include/linux/usb/usbnet.h
    net/ipv4/tcp_ipv4.c
    net/ipv6/tcp_ipv6.c

    The TCP conflicts were overlapping changes. In 'net' we added a
    READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
    Eric Dumazet touched the surrounding code dealing with how mini
    sockets are handled.

    With USB, it's a case of the same bug fix first going into net-next
    and then I cherry picked it back into net.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Apr, 2015

1 commit

  • When remove TIPC module, there is a warning to remind us that a slab
    object is leaked like:

    root@localhost:~# rmmod tipc
    [ 19.056226] =============================================================================
    [ 19.057549] BUG TIPC (Not tainted): Objects remaining in TIPC on kmem_cache_close()
    [ 19.058736] -----------------------------------------------------------------------------
    [ 19.058736]
    [ 19.060287] INFO: Slab 0xffffea0000519a00 objects=23 used=1 fp=0xffff880014668b00 flags=0x100000000004080
    [ 19.061915] INFO: Object 0xffff880014668000 @offset=0
    [ 19.062717] kmem_cache_destroy TIPC: Slab cache still has objects

    This is because the listening socket of TIPC topology server is not
    closed before TIPC proto handler is unregistered with proto_unregister().
    However, as the socket is closed in tipc_exit_net() which is called by
    unregister_pernet_subsys() during unregistering TIPC namespace operation,
    the warning can be eliminated if calling unregister_pernet_subsys() is
    moved before calling proto_unregister().

    Fixes: e05b31f4bf89 ("tipc: make tipc socket support net namespace")
    Reviewed-by: Erik Hugne
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

30 Mar, 2015

3 commits

  • A message sent to a node after a successful name table lookup may still
    find that the destination socket has disappeared, because distribution
    of name table updates is non-atomic. If so, the message will be rejected
    back to the sender with error code TIPC_ERR_NO_PORT. If the source
    socket of the message has disappeared in the meantime, the message
    should be dropped.

    However, in the currrent code, the message will instead be subject to an
    unwanted tertiary lookup, because the function tipc_msg_lookup_dest()
    doesn't check if there is an error code present in the message before
    performing the lookup. In the worst case, the message may now find the
    old destination again, and be redirected once more, instead of being
    dropped directly as it should be.

    A second bug in this function is that the "prev_node" field in the message
    is not updated after successful lookup, something that may have
    unpredictable consequences.

    The problems arising from those bugs occur very infrequently.

    The third change in this function; the test on msg_reroute_msg_cnt() is
    purely cosmetic, reflecting that the returned value never can be negative.

    This commit corrects the two bugs described above.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • TIPC node hash node table is protected with rcu lock on read side.
    tipc_node_find() is used to look for a node object with node address
    through iterating the hash node table. As the entire process of what
    tipc_node_find() traverses the table is guarded with rcu read lock,
    it's safe for us. However, when callers use the node object returned
    by tipc_node_find(), there is no rcu read lock applied. Therefore,
    this is absolutely unsafe for callers of tipc_node_find().

    Now we introduce a reference counter for node structure. Before
    tipc_node_find() returns node object to its caller, it first increases
    the reference counter. Accordingly, after its caller used it up,
    it decreases the counter again. This can prevent a node being used by
    one thread from being freed by another thread.

    Reviewed-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     
  • [ 60.988363] ======================================================
    [ 60.988754] [ INFO: possible circular locking dependency detected ]
    [ 60.989152] 3.19.0+ #194 Not tainted
    [ 60.989377] -------------------------------------------------------
    [ 60.989781] swapper/3/0 is trying to acquire lock:
    [ 60.990079] (&(&n_ptr->lock)->rlock){+.-...}, at: [] tipc_link_retransmit+0x1aa/0x240 [tipc]
    [ 60.990743]
    [ 60.990743] but task is already holding lock:
    [ 60.991106] (&(&bclink->lock)->rlock){+.-...}, at: [] tipc_bclink_lock+0x8e/0xa0 [tipc]
    [ 60.991738]
    [ 60.991738] which lock already depends on the new lock.
    [ 60.991738]
    [ 60.992174]
    [ 60.992174] the existing dependency chain (in reverse order) is:
    [ 60.992174]
    -> #1 (&(&bclink->lock)->rlock){+.-...}:
    [ 60.992174] [] lock_acquire+0x9c/0x140
    [ 60.992174] [] _raw_spin_lock_bh+0x3f/0x50
    [ 60.992174] [] tipc_bclink_lock+0x8e/0xa0 [tipc]
    [ 60.992174] [] tipc_bclink_add_node+0x97/0xf0 [tipc]
    [ 60.992174] [] tipc_node_link_up+0xf5/0x110 [tipc]
    [ 60.992174] [] link_state_event+0x2b3/0x4f0 [tipc]
    [ 60.992174] [] tipc_link_proto_rcv+0x24c/0x418 [tipc]
    [ 60.992174] [] tipc_rcv+0x827/0xac0 [tipc]
    [ 60.992174] [] tipc_l2_rcv_msg+0x73/0xd0 [tipc]
    [ 60.992174] [] __netif_receive_skb_core+0x746/0x980
    [ 60.992174] [] __netif_receive_skb+0x21/0x70
    [ 60.992174] [] netif_receive_skb_internal+0x35/0x130
    [ 60.992174] [] napi_gro_receive+0x158/0x1d0
    [ 60.992174] [] e1000_clean_rx_irq+0x155/0x490
    [ 60.992174] [] e1000_clean+0x267/0x990
    [ 60.992174] [] net_rx_action+0x150/0x360
    [ 60.992174] [] __do_softirq+0x123/0x360
    [ 60.992174] [] irq_exit+0x8e/0xb0
    [ 60.992174] [] do_IRQ+0x65/0x110
    [ 60.992174] [] ret_from_intr+0x0/0x13
    [ 60.992174] [] arch_cpu_idle+0xf/0x20
    [ 60.992174] [] cpu_startup_entry+0x2f6/0x3f0
    [ 60.992174] [] start_secondary+0x13a/0x150
    [ 60.992174]
    -> #0 (&(&n_ptr->lock)->rlock){+.-...}:
    [ 60.992174] [] __lock_acquire+0x163d/0x1ca0
    [ 60.992174] [] lock_acquire+0x9c/0x140
    [ 60.992174] [] _raw_spin_lock_bh+0x3f/0x50
    [ 60.992174] [] tipc_link_retransmit+0x1aa/0x240 [tipc]
    [ 60.992174] [] tipc_bclink_rcv+0x611/0x640 [tipc]
    [ 60.992174] [] tipc_rcv+0x616/0xac0 [tipc]
    [ 60.992174] [] tipc_l2_rcv_msg+0x73/0xd0 [tipc]
    [ 60.992174] [] __netif_receive_skb_core+0x746/0x980
    [ 60.992174] [] __netif_receive_skb+0x21/0x70
    [ 60.992174] [] netif_receive_skb_internal+0x35/0x130
    [ 60.992174] [] napi_gro_receive+0x158/0x1d0
    [ 60.992174] [] e1000_clean_rx_irq+0x155/0x490
    [ 60.992174] [] e1000_clean+0x267/0x990
    [ 60.992174] [] net_rx_action+0x150/0x360
    [ 60.992174] [] __do_softirq+0x123/0x360
    [ 60.992174] [] irq_exit+0x8e/0xb0
    [ 60.992174] [] do_IRQ+0x65/0x110
    [ 60.992174] [] ret_from_intr+0x0/0x13
    [ 60.992174] [] arch_cpu_idle+0xf/0x20
    [ 60.992174] [] cpu_startup_entry+0x2f6/0x3f0
    [ 60.992174] [] start_secondary+0x13a/0x150
    [ 60.992174]
    [ 60.992174] other info that might help us debug this:
    [ 60.992174]
    [ 60.992174] Possible unsafe locking scenario:
    [ 60.992174]
    [ 60.992174] CPU0 CPU1
    [ 60.992174] ---- ----
    [ 60.992174] lock(&(&bclink->lock)->rlock);
    [ 60.992174] lock(&(&n_ptr->lock)->rlock);
    [ 60.992174] lock(&(&bclink->lock)->rlock);
    [ 60.992174] lock(&(&n_ptr->lock)->rlock);
    [ 60.992174]
    [ 60.992174] *** DEADLOCK ***
    [ 60.992174]
    [ 60.992174] 3 locks held by swapper/3/0:
    [ 60.992174] #0: (rcu_read_lock){......}, at: [] __netif_receive_skb_core+0x71/0x980
    [ 60.992174] #1: (rcu_read_lock){......}, at: [] tipc_l2_rcv_msg+0x5/0xd0 [tipc]
    [ 60.992174] #2: (&(&bclink->lock)->rlock){+.-...}, at: [] tipc_bclink_lock+0x8e/0xa0 [tipc]
    [ 60.992174]

    The correct the sequence of grabbing n_ptr->lock and bclink->lock
    should be that the former is first held and the latter is then taken,
    which exactly happened on CPU1. But especially when the retransmission
    of broadcast link is failed, bclink->lock is first held in
    tipc_bclink_rcv(), and n_ptr->lock is taken in link_retransmit_failure()
    called by tipc_link_retransmit() subsequently, which is demonstrated on
    CPU0. As a result, deadlock occurs.

    If the order of holding the two locks happening on CPU0 is reversed, the
    deadlock risk will be relieved. Therefore, the node lock taken in
    link_retransmit_failure() originally is moved to tipc_bclink_rcv()
    so that it's obtained before bclink lock. But the precondition of
    the adjustment of node lock is that responding to bclink reset event
    must be moved from tipc_bclink_unlock() to tipc_node_unlock().

    Reviewed-by: Erik Hugne
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

26 Mar, 2015

2 commits

  • Despite recent improvements, the establishment of dual parallel
    links still has a small glitch where messages can bypass each
    other. When the second link in a dual-link configuration is
    established, part of the first link's traffic will be steered over
    to the new link. Although we do have a mechanism to ensure that
    packets sent before and after the establishment of the new link
    arrive in sequence to the destination node, this is not enough.
    The arriving messages will still be delivered upwards in different
    threads, something entailing a risk of message disordering during
    the transition phase.

    To fix this, we introduce a synchronization mechanism between the
    two parallel links, so that traffic arriving on the new link cannot
    be added to its input queue until we are guaranteed that all
    pre-establishment messages have been delivered on the old, parallel
    link.

    This problem seems to always have been around, but its occurrence is
    so rare that it has not been noticed until recent intensive testing.

    Reviewed-by: Ying Xue
    Reviewed-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • After the recent changes in message importance handling it becomes
    possible to simplify handling of messages and sockets when we
    encounter link congestion.

    We merge the function tipc_link_cong() into link_schedule_user(),
    and simplify the code of the latter. The code should now be
    easier to follow, especially regarding return codes and handling
    of the message that caused the situation.

    In case the scheduling function is unable to pre-allocate a wakeup
    message buffer, it now returns -ENOBUFS, which is a more correct
    code than the previously used -EHOSTUNREACH.

    Reviewed-by: Ying Xue
    Reviewed-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy