19 Aug, 2019

1 commit

  • The policy for handling the skb list locks on the send and receive paths
    is simple.

    - On the send path we never need to grab the lock on the 'xmitq' list
    when the destination is an exernal node.

    - On the receive path we always need to grab the lock on the 'inputq'
    list, irrespective of source node.

    However, when transmitting node local messages those will eventually
    end up on the receive path of a local socket, meaning that the argument
    'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the
    function tipc_sk_rcv(). This has been handled by always initializing
    the spinlock of the 'xmitq' list at message creation, just in case it
    may end up on the receive path later, and despite knowing that the lock
    in most cases never will be used.

    This approach is inaccurate and confusing, and has also concealed the
    fact that the stated 'no lock grabbing' policy for the send path is
    violated in some cases.

    We now clean up this by never initializing the lock at message creation,
    instead doing this at the moment we find that the message actually will
    enter the receive path. At the same time we fix the four locations
    where we incorrectly access the spinlock on the send/error path.

    This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock
    is initialised") which has now become redundant.

    CC: Eric Dumazet
    Reported-by: Chris Packham
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Reviewed-by: Xin Long
    Signed-off-by: David S. Miller

    Jon Maloy
     

09 Aug, 2019

1 commit

  • Since node internal messages are passed directly to the socket, it is not
    possible to observe those messages via tcpdump or wireshark.

    We now remedy this by making it possible to clone such messages and send
    the clones to the loopback interface. The clones are dropped at reception
    and have no functional role except making the traffic visible.

    The feature is enabled if network taps are active for the loopback device.
    pcap filtering restrictions require the messages to be presented to the
    receiving side of the loopback device.

    v3 - Function dev_nit_active used to check for network taps.
    - Procedure netif_rx_ni used to send cloned messages to loopback device.

    Signed-off-by: John Rutherford
    Acked-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    John Rutherford
     

26 Jul, 2019

1 commit

  • This commit along with the next one are to resolve the issues with the
    link changeover mechanism. See that commit for details.

    Basically, for the link synching, from now on, we will send only one
    single ("dummy") SYNCH message to peer. The SYNCH message does not
    contain any data, just a header conveying the synch point to the peer.

    A new node capability flag ("TIPC_TUNNEL_ENHANCED") is introduced for
    backward compatible!

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Suggested-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

18 Jul, 2019

1 commit

  • The tipc_msg_validate() function leaves a boolean flag 'validated' in
    the validated buffer's control block, to avoid performing this action
    more than once. However, at reception of new packets, the position of
    this field may already have been set by lower layer protocols, so
    that the packet is erroneously perceived as already validated by TIPC.

    We fix this by initializing the said field to 'false' before performing
    the initial validation.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

26 Jun, 2019

1 commit


19 Jun, 2019

1 commit

  • It appears that a FAILOVER_MSG can come from peer even when the failure
    link is resetting (i.e. just after the 'node_write_unlock()'...). This
    means the failover procedure on the node has not been started yet.
    The situation is as follows:

    node1 node2
    linkb linka linka linkb
    | | | |
    | | x failure |
    | | RESETTING |
    | | | |
    | x failure RESET |
    | RESETTING FAILINGOVER |
    | | (FAILOVER_MSG) | |
    ||
    | RESET | | FAILOVER_END
    | FAILINGOVER RESET |
    . . . .
    . . . .
    . . . .

    Once this happens, the link failover procedure will be triggered
    wrongly on the receiving node since the node isn't in FAILINGOVER state
    but then another link failover will be carried out.
    The consequences are:

    1) A peer might get stuck in FAILINGOVER state because the 'sync_point'
    was set, reset and set incorrectly, the criteria to end the failover
    would not be met, it could keep waiting for a message that has already
    received.

    2) The early FAILOVER_MSG(s) could be queued in the link failover
    deferdq but would be purged or not pulled out because the 'drop_point'
    was not set correctly.

    3) The early FAILOVER_MSG(s) could be dropped too.

    4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state
    shortly, but later on it would be restarted.

    The same situation can also happen when the link is in PEER_RESET state
    and a FAILOVER_MSG arrives.

    The commit resolves the issues by forcing the link down immediately, so
    the failover procedure will be started normally (which is the same as
    when receiving a FAILOVER_MSG and the link is in up state).

    Also, the function "tipc_node_link_failover()" is toughen to avoid such
    a situation from happening.

    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

04 May, 2019

1 commit

  • TIPC link can temporarily fall into "half-establish" that only one of
    the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
    messages, whereas the other link endpoint is not up (e.g. immediately
    when the endpoint receives ACTIVATE_MSG, the network interface goes
    down...).

    This is a normal situation and will be settled because the link
    endpoint will be eventually brought down after the link tolerance time.

    However, the situation will become worse when the second link is
    established before the first link endpoint goes down,
    For example:

    1. Both links , down
    2. Link endpoint 2A up, but 1A still down (e.g. due to network
    disturbance, wrong session, etc.)
    3. Link up
    4. Link endpoint 2A down (e.g. due to link tolerance timeout)
    5. Node B starts failover onto link

    ==> Node A does never start link failover.

    When the "half-failover" situation happens, two consequences have been
    observed:

    a) Peer link/node gets stuck in FAILINGOVER state;
    b) Traffic or user messages that peer node is trying to failover onto
    the second link can be partially or completely dropped by this node.

    The consequence a) was actually solved by commit c140eb166d68 ("tipc:
    fix failover problem"), but that commit didn't cover the b). It's due
    to the fact that the tunnel link endpoint has never been prepared for a
    failover, so the 'l->drop_point' (and the other data...) is not set
    correctly. When a TUNNEL_MSG from peer node arrives on the link,
    depending on the inner message's seqno and the current 'l->drop_point'
    value, the message can be dropped (- treated as a duplicate message) or
    processed.
    At this early stage, the traffic messages from peer are likely to be
    NAME_DISTRIBUTORs, this means some name table entries will be missed on
    the node forever!

    The commit resolves the issue by starting the FAILOVER process on this
    node as well. Another benefit from this solution is that we ensure the
    link will not be re-established until the failover ends.

    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

12 Apr, 2019

1 commit

  • In the function tipc_node_create() we protect the peer capability field
    by using the node rw_lock. However, we access the lock directly instead
    of using the dedicated functions for this, as we do everywhere else in
    node.c. This cosmetic spot is fixed here.

    Fixes: 40999f11ce67 ("tipc: make link capability update thread safe")
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

28 Mar, 2019

1 commit


24 Mar, 2019

1 commit

  • When checking the code with clang -Wsometimes-uninitialized we get the
    following warning:

    if (!tipc_link_is_establishing(l)) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    net/tipc/node.c:847:46: note: uninitialized use occurs here
    tipc_bearer_xmit(n->net, bearer_id, &xmitq, maddr);

    net/tipc/node.c:831:2: note: remove the 'if' if its condition is always
    true
    if (!tipc_link_is_establishing(l)) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    net/tipc/node.c:821:31: note: initialize the variable 'maddr' to silence
    this warning
    struct tipc_media_addr *maddr;

    We fix this by initializing 'maddr' to NULL. For the matter of clarity,
    we also test if 'xmitq' is non-empty before we use it and 'maddr'
    further down in the function. It will never happen that 'xmitq' is non-
    empty at the same time as 'maddr' is NULL, so this is a sufficient test.

    Fixes: 598411d70f85 ("tipc: make resetting of links non-atomic")
    Reported-by: Nathan Chancellor
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

20 Mar, 2019

1 commit

  • As a preparation for introducing a smooth switching between replicast
    and broadcast method for multicast message, We have to introduce a new
    capability flag TIPC_MCAST_RBCTL to handle this new feature.

    During a cluster upgrade a node can come back with this new capabilities
    which also must be reflected in the cluster capabilities field.
    The new feature is only applicable if all node in the cluster supports
    this new capability.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

12 Feb, 2019

1 commit

  • When a link endpoint is re-created (e.g. after a node reboot or
    interface reset), the link session number is varied by random, the peer
    endpoint will be synced with this new session number before the link is
    re-established.

    However, there is a shortcoming in this mechanism that can lead to the
    link never re-established or faced with a failure then. It happens when
    the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as
    well as the 'in_session' flag have been set, but suddenly this link
    endpoint leaves. When it comes back with a random session number, there
    are two situations possible:

    1/ If the random session number is larger than (or equal to) the
    previous one, the peer endpoint will be updated with this new session
    upon receipt of a RESET_MSG from this endpoint, and the link can be re-
    established as normal. Otherwise, all the RESET_MSGs from this endpoint
    will be rejected by the peer. In turn, when this link endpoint receives
    one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start
    to send STATE_MSGs, but again these messages will be dropped by the
    peer due to wrong session.
    The peer link endpoint can still become ESTABLISHED after receiving a
    traffic message from this endpoint (e.g. a BCAST_PROTOCOL or
    NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link
    will be forced down sooner or later!

    Even in case the random session number is larger than the previous one,
    it can be that the ACTIVATE_MSG from the peer arrives first, and this
    link endpoint moves quickly to ESTABLISHED without sending out any
    RESET_MSG yet. Consequently, the peer link will not be updated with the
    new session number, and the same link failure scenario as above will
    happen.

    2/ Another situation can be that, the peer link endpoint was reset due
    to any reasons in the meantime, its link state was set to RESET from
    ESTABLISHING but still in session, i.e. the 'in_session' flag is not
    reset...
    Now, if the random session number from this endpoint is less than the
    previous one, all the RESET_MSGs from this endpoint will be rejected by
    the peer. In the other direction, when this link endpoint receives a
    RESET_MSG from the peer, it moves to ESTABLISHING and starts to send
    ACTIVATE_MSGs, but all these messages will be rejected by the peer too.
    As a result, the link cannot be re-established but gets stuck with this
    link endpoint in state ESTABLISHING and the peer in RESET!

    Solution:

    ===========

    This link endpoint should not go directly to ESTABLISHED when getting
    ACTIVATE_MSG from the peer which may belong to the old session if the
    link was re-created. To ensure the session to be correct before the
    link is re-established, the peer endpoint in ESTABLISHING state will
    send back the last session number in ACTIVATE_MSG for a verification at
    this endpoint. Then, if needed, a new and more appropriate session
    number will be regenerated to force a re-synch first.

    In addition, when a link in ESTABLISHING state is reset, its state will
    move to RESET according to the link FSM, along with resetting the
    'in_session' flag (and the other data) as a normal link reset, it will
    also be deleted if requested.

    The solution is backward compatible.

    Acked-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

20 Dec, 2018

3 commits

  • The commit adds the new trace_events for TIPC node object:

    trace_tipc_node_create()
    trace_tipc_node_delete()
    trace_tipc_node_lost_contact()
    trace_tipc_node_timeout()
    trace_tipc_node_link_up()
    trace_tipc_node_link_down()
    trace_tipc_node_reset_links()
    trace_tipc_node_fsm_evt()
    trace_tipc_node_check_state()

    Also, enables the traces for the following cases:
    - When a node is created/deleted;
    - When a node contact is lost;
    - When a node timer is timed out;
    - When a node link is up/down;
    - When all node links are reset;
    - When node state is changed;
    - When a skb comes and node state needs to be checked/updated.

    Acked-by: Ying Xue
    Tested-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • The commit adds the new trace_events for TIPC link object:

    trace_tipc_link_timeout()
    trace_tipc_link_fsm()
    trace_tipc_link_reset()
    trace_tipc_link_too_silent()
    trace_tipc_link_retrans()
    trace_tipc_link_bc_ack()
    trace_tipc_link_conges()

    And the traces for PROTOCOL messages at building and receiving:

    trace_tipc_proto_build()
    trace_tipc_proto_rcv()

    Note:
    a) The 'tipc_link_too_silent' event will only happen when the
    'silent_intv_cnt' is about to reach the 'abort_limit' value (and the
    event is enabled). The benefit for this kind of event is that we can
    get an early indication about TIPC link loss issue due to timeout, then
    can do some necessary actions for troubleshooting.

    For example: To trigger the 'tipc_proto_rcv' when the 'too_silent'
    event occurs:

    echo 'enable_event:tipc:tipc_proto_rcv' > \
    events/tipc/tipc_link_too_silent/trigger

    And disable it when TIPC link is reset:

    echo 'disable_event:tipc:tipc_proto_rcv' > \
    events/tipc/tipc_link_reset/trigger

    b) The 'tipc_link_retrans' or 'tipc_link_bc_ack' event is useful to
    trace TIPC retransmission issues.

    In addition, the commit adds the 'trace_tipc_list/link_dump()' at the
    'retransmission failure' case. Then, if the issue occurs, the link
    'transmq' along with the link data can be dumped for post-analysis.
    These dump events should be enabled by default since it will only take
    effect when the failure happens.

    The same approach is also applied for the faulty case that the
    validation of protocol message is failed.

    Acked-by: Ying Xue
    Tested-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • As for the sake of debugging/tracing, the commit enables tracepoints in
    TIPC along with some general trace_events as shown below. It also
    defines some 'tipc_*_dump()' functions that allow to dump TIPC object
    data whenever needed, that is, for general debug purposes, ie. not just
    for the trace_events.

    The following trace_events are now available:

    - trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
    e.g. message type, user, droppable, skb truesize, cloned skb, etc.

    - trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
    queues, e.g. TIPC link transmq, socket receive queue, etc.

    - trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
    sk state, sk type, connection type, rmem_alloc, socket queues, etc.

    - trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
    link state, silent_intv_cnt, gap, bc_gap, link queues, etc.

    - trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
    node state, active links, capabilities, link entries, etc.

    How to use:
    Put the trace functions at any places where we want to dump TIPC data
    or events.

    Note:
    a) The dump functions will generate raw data only, that is, to offload
    the trace event's processing, it can require a tool or script to parse
    the data but this should be simple.

    b) The trace_tipc_*_dump() should be reserved for a failure cases only
    (e.g. the retransmission failure case) or where we do not expect to
    happen too often, then we can consider enabling these events by default
    since they will almost not take any effects under normal conditions,
    but once the rare condition or failure occurs, we get the dumped data
    fully for post-analysis.

    For other trace purposes, we can reuse these trace classes as template
    but different events.

    c) A trace_event is only effective when we enable it. To enable the
    TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
    directory in the 'debugfs' file system. Normally, they are located at:

    /sys/kernel/debug/tracing/events/tipc/

    For example:

    To enable the tipc_link_dump event:

    echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable

    To enable all the TIPC trace_events:

    echo 1 > /sys/kernel/debug/tracing/events/tipc/enable

    To collect the trace data:

    cat trace

    or

    cat trace_pipe > /trace.out &

    To disable all the TIPC trace_events:

    echo 0 > /sys/kernel/debug/tracing/events/tipc/enable

    To clear the trace buffer:

    echo > trace

    d) Like the other trace_events, the feature like 'filter' or 'trigger'
    is also usable for the tipc trace_events.
    For more details, have a look at:

    Documentation/trace/ftrace.txt

    MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc

    Acked-by: Ying Xue
    Tested-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

19 Dec, 2018

1 commit

  • NAME_DISTRIBUTOR messages are transmitted through unicast link on TIPC
    2.0, by contrast, the messages are delivered through broadcast link on
    TIPC 1.7. But at present, NAME_DISTRIBUTOR messages received by
    broadcast link cannot be handled in tipc_rcv() until an unicast message
    arrives, which may lead to a significant delay to update name table.

    To avoid this delay, we will also deal with broadcast NAME_DISTRIBUTOR
    message on broadcast receive path.

    Signed-off-by: Zhenbo Gao
    Reviewed-by: Ying Xue
    Signed-off-by: David S. Miller

    Zhenbo Gao
     

06 Dec, 2018

1 commit

  • When setting LINK tolerance, node timer interval will be calculated
    base on the LINK with lowest tolerance.

    But when calculated, the old node timer interval only updated if current
    setting value (tolerance/4) less than old ones regardless of number of
    links as well as links' lowest tolerance value.

    This caused to two cases missing if tolerance changed as following:
    Case 1:
    1.1/ There is one link (L1) available in the system
    1.2/ Set L1's tolerance from 1500ms => lower (i.e 500ms)
    1.3/ Then, fallback to default (1500ms) or higher (i.e 2000ms)

    Expected:
    node timer interval is 1500/4=375ms after 1.3

    Result:
    node timer interval will not being updated after changing tolerance at 1.3
    since its value 1500/4=375ms is not less than 500/4=125ms at 1.2.

    Case 2:
    2.1/ There are two links (L1, L2) available in the system
    2.2/ L1 and L2 tolerance value are 2000ms as initial
    2.3/ Set L2's tolerance from 2000ms => lower 1500ms
    2.4/ Disable link L2 (bring down its bearer)

    Expected:
    node timer interval is 2000ms/4=500ms after 2.4

    Result:
    node timer interval will not being updated after disabling L2 since
    its value 2000ms/4=500ms is still not less than 1500/4=375ms at 2.3
    although L2 is already not available in the system.

    To fix this, we start the node interval calculation by initializing it to
    a value larger than any conceivable calculated value. This way, the link
    with the lowest tolerance will always determine the calculated value.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

28 Nov, 2018

1 commit

  • We see the following lockdep warning:

    [ 2284.078521] ======================================================
    [ 2284.078604] WARNING: possible circular locking dependency detected
    [ 2284.078604] 4.19.0+ #42 Tainted: G E
    [ 2284.078604] ------------------------------------------------------
    [ 2284.078604] rmmod/254 is trying to acquire lock:
    [ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0
    [ 2284.078604]
    [ 2284.078604] but task is already holding lock:
    [ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc]
    [ 2284.078604]
    [ 2284.078604] which lock already depends on the new lock.
    [ 2284.078604]
    [ 2284.078604]
    [ 2284.078604] the existing dependency chain (in reverse order) is:
    [ 2284.078604]
    [ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
    [ 2284.078604] tipc_node_timeout+0x20a/0x330 [tipc]
    [ 2284.078604] call_timer_fn+0xa1/0x280
    [ 2284.078604] run_timer_softirq+0x1f2/0x4d0
    [ 2284.078604] __do_softirq+0xfc/0x413
    [ 2284.078604] irq_exit+0xb5/0xc0
    [ 2284.078604] smp_apic_timer_interrupt+0xac/0x210
    [ 2284.078604] apic_timer_interrupt+0xf/0x20
    [ 2284.078604] default_idle+0x1c/0x140
    [ 2284.078604] do_idle+0x1bc/0x280
    [ 2284.078604] cpu_startup_entry+0x19/0x20
    [ 2284.078604] start_secondary+0x187/0x1c0
    [ 2284.078604] secondary_startup_64+0xa4/0xb0
    [ 2284.078604]
    [ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
    [ 2284.078604] del_timer_sync+0x34/0xa0
    [ 2284.078604] tipc_node_delete+0x1a/0x40 [tipc]
    [ 2284.078604] tipc_node_stop+0xcb/0x190 [tipc]
    [ 2284.078604] tipc_net_stop+0x154/0x170 [tipc]
    [ 2284.078604] tipc_exit_net+0x16/0x30 [tipc]
    [ 2284.078604] ops_exit_list.isra.8+0x36/0x70
    [ 2284.078604] unregister_pernet_operations+0x87/0xd0
    [ 2284.078604] unregister_pernet_subsys+0x1d/0x30
    [ 2284.078604] tipc_exit+0x11/0x6f2 [tipc]
    [ 2284.078604] __x64_sys_delete_module+0x1df/0x240
    [ 2284.078604] do_syscall_64+0x66/0x460
    [ 2284.078604] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 2284.078604]
    [ 2284.078604] other info that might help us debug this:
    [ 2284.078604]
    [ 2284.078604] Possible unsafe locking scenario:
    [ 2284.078604]
    [ 2284.078604] CPU0 CPU1
    [ 2284.078604] ---- ----
    [ 2284.078604] lock(&(&tn->node_list_lock)->rlock);
    [ 2284.078604] lock((&n->timer)#2);
    [ 2284.078604] lock(&(&tn->node_list_lock)->rlock);
    [ 2284.078604] lock((&n->timer)#2);
    [ 2284.078604]
    [ 2284.078604] *** DEADLOCK ***
    [ 2284.078604]
    [ 2284.078604] 3 locks held by rmmod/254:
    [ 2284.078604] #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30
    [ 2284.078604] #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc]
    [ 2284.078604] #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19
    [...}

    The reason is that the node timer handler sometimes needs to delete a
    node which has been disconnected for too long. To do this, it grabs
    the lock 'node_list_lock', which may at the same time be held by the
    generic node cleanup function, tipc_node_stop(), during module removal.
    Since the latter is calling del_timer_sync() inside the same lock, we
    have a potential deadlock.

    We fix this letting the timer cleanup function use spin_trylock()
    instead of just spin_lock(), and when it fails to grab the lock it
    just returns so that the timer handler can terminate its execution.
    This is safe to do, since tipc_node_stop() anyway is about to
    delete both the timer and the node instance.

    Fixes: 6a939f365bdb ("tipc: Auto removal of peer down node instance")
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

02 Oct, 2018

1 commit

  • The initial session number when a link is created is based on a random
    value, taken from struct tipc_net->random. It is then incremented for
    each link reset to avoid mixing protocol messages from different link
    sessions.

    However, when a bearer is reset all its links are deleted, and will
    later be re-created using the same random value as the first time.
    This means that if the link never went down between creation and
    deletion we will still sometimes have two subsequent sessions with
    the same session number. In virtual environments with potentially
    long transmission times this has turned out to be a real problem.

    We now fix this by randomizing the session number each time a link
    is created.

    With a session number size of 16 bits this gives a risk of session
    collision of 1/64k. To reduce this further, we also introduce a sanity
    check on the very first STATE message arriving at a link. If this has
    an acknowledge value differing from 0, which is logically impossible,
    we ignore the message. The final risk for session collision is hence
    reduced to 1/4G, which should be sufficient.

    Signed-off-by: LUU Duc Canh
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    LUU Duc Canh
     

30 Sep, 2018

1 commit

  • We see the following scenario:
    1) Link endpoint B on node 1 discovers that its peer endpoint is gone.
    Since there is a second working link, failover procedure is started.
    2) Link endpoint A on node 1 sends a FAILOVER message to peer endpoint
    A on node 2. The node item 1->2 goes to state FAILINGOVER.
    3) Linke endpoint A/2 receives the failover, and is supposed to take
    down its parallell link endpoint B/2, while producing a FAILOVER
    message to send back to A/1.
    4) However, B/2 has already been deleted, so no FAILOVER message can
    created.
    5) Node 1->2 remains in state FAILINGOVER forever, refusing to receive
    any messages that can bring B/1 up again. We are left with a non-
    redundant link between node 1 and 2.

    We fix this with letting endpoint A/2 build a dummy FAILOVER message
    to send to back to A/1, so that the situation can be resolved.

    Signed-off-by: LUU Duc Canh
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    LUU Duc Canh
     

21 Jul, 2018

2 commits

  • All conflicts were trivial overlapping changes, so reasonably
    easy to resolve.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The commit referred to below introduced an update of the link
    capabilities field that is not safe. Given the recently added
    feature to remove idle node and link items after 5 minutes, there
    is a small risk that the update will happen at the very moment the
    targeted link is being removed. To avoid this we have to perform
    the update inside the node item's write lock protection.

    Fixes: 9012de508956 ("tipc: add sequence number check for link STATE messages")
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

12 Jul, 2018

2 commits

  • In some virtual environments we observe a significant higher number of
    packet reordering and delays than we have been used to traditionally.

    This makes it necessary with stricter checks on incoming link protocol
    messages' session number, which until now only has been validated for
    RESET messages.

    Since the other two message types, ACTIVATE and STATE messages also
    carry this number, it is easy to extend the validation check to those
    messages.

    We also introduce a flag indicating if a link has a valid peer session
    number or not. This eliminates the mixing of 32- and 16-bit arithmethics
    we are currently using to achieve this.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     
  • Some switch infrastructures produce huge amounts of packet duplicates.
    This becomes a problem if those messages are STATE/NACK protocol
    messages, causing unnecessary retransmissions of already accepted
    packets.

    We now introduce a unique sequence number per STATE protocol message
    so that duplicates can be identified and ignored. This will also be
    useful when tracing such cases, and to avert replay attacks when TIPC
    is encrypted.

    For compatibility reasons we have to introduce a new capability flag
    TIPC_LINK_PROTO_SEQNO to handle this new feature.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

07 Jul, 2018

1 commit

  • The function for checking if there is an node address conflict is
    supposed to return a suggestion for a new address if it finds a
    conflict, and zero otherwise. But in case the peer being checked
    is previously unknown it does instead return a "suggestion" for
    the checked address itself. This results in a DSC_TRIAL_FAIL_MSG
    being sent unecessarily to the peer, and sometimes makes the trial
    period starting over again.

    Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash values")
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

30 Jun, 2018

2 commits

  • A peer node is considered down if there are no
    active links (or) lost contact to the node. In current implementation,
    a peer node instance is deleted either if

    a) TIPC module is removed (or)
    b) Application can use a netlink/iproute2 interface to delete a
    specific down node.

    Thus, a down node instance lives in the system forever, unless the
    application explicitly removes it.

    We fix this by deleting the nodes which are down for
    a specified amount of time (5 minutes).
    Existing node supervision timer is used to achieve this.

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: GhantaKrishnamurthy MohanKrishna
    Signed-off-by: David S. Miller

    GhantaKrishnamurthy MohanKrishna
     
  • In single-link usage, the function tipc_node_timeout() still iterates
    over the whole link array to handle each link. Given that the maximum
    number of bearers are 3, there are 2 redundant iterations with lock
    grab/release. Since this function is executing very frequently it makes
    sense to optimize it.

    This commit adds conditional checking to exit from the loop if the
    known number of configured links has already been accessed.

    Acked-by: Ying Xue
    Signed-off-by: Tung Nguyen
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     

12 May, 2018

1 commit

  • The bpf syscall and selftests conflicts were trivial
    overlapping changes.

    The r8169 change involved moving the added mdelay from 'net' into a
    different function.

    A TLS close bug fix overlapped with the splitting of the TLS state
    into separate TX and RX parts. I just expanded the tests in the bug
    fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
    == X".

    Signed-off-by: David S. Miller

    David S. Miller
     

10 May, 2018

1 commit

  • When we get link properties through netlink interface with
    tipc_nl_node_get_link(), we don't validate TIPC_NLA_LINK_NAME
    attribute at all, instead we directly use it. As a consequence,
    KMSAN detected the TIPC_NLA_LINK_NAME attribute was an uninitialized
    value, and then posted the following complaint:

    ==================================================================
    BUG: KMSAN: uninit-value in strcmp+0xf7/0x160 lib/string.c:329
    CPU: 1 PID: 4527 Comm: syz-executor655 Not tainted 4.16.0+ #87
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x185/0x1d0 lib/dump_stack.c:53
    kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
    __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
    strcmp+0xf7/0x160 lib/string.c:329
    tipc_nl_node_get_link+0x220/0x6f0 net/tipc/node.c:1881
    genl_family_rcv_msg net/netlink/genetlink.c:599 [inline]
    genl_rcv_msg+0x1686/0x1810 net/netlink/genetlink.c:624
    netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2447
    genl_rcv+0x63/0x80 net/netlink/genetlink.c:635
    netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline]
    netlink_unicast+0x166b/0x1740 net/netlink/af_netlink.c:1337
    netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg net/socket.c:640 [inline]
    ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
    __sys_sendmsg net/socket.c:2080 [inline]
    SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
    SyS_sendmsg+0x54/0x80 net/socket.c:2087
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x445589
    RSP: 002b:00007fb7ee66cdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000006dac24 RCX: 0000000000445589
    RDX: 0000000000000000 RSI: 0000000020023000 RDI: 0000000000000003
    RBP: 00000000006dac20 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    R13: 00007fffa2bf3f3f R14: 00007fb7ee66d9c0 R15: 0000000000000001

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
    kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
    kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
    slab_post_alloc_hook mm/slab.h:445 [inline]
    slab_alloc_node mm/slub.c:2737 [inline]
    __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
    __kmalloc_reserve net/core/skbuff.c:138 [inline]
    __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
    alloc_skb include/linux/skbuff.h:984 [inline]
    netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline]
    netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg net/socket.c:640 [inline]
    ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
    __sys_sendmsg net/socket.c:2080 [inline]
    SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
    SyS_sendmsg+0x54/0x80 net/socket.c:2087
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    ==================================================================

    To quiet the complaint, TIPC_NLA_LINK_NAME attribute has been
    validated in tipc_nl_node_get_link() before it's used.

    Reported-by: syzbot+df0257c92ffd4fcc58cd@syzkaller.appspotmail.com
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

04 May, 2018

1 commit


27 Apr, 2018

2 commits

  • After the introduction of a 128-bit node identity it may be difficult
    for a user to correlate between this identity and the generated node
    hash address.

    We now try to make this easier by introducing a new ioctl() call for
    fetching a node identity by using the hash value as key. This will
    be particularly useful when we extend some of the commands in the
    'tipc' tool, but we also expect regular user applications to need
    this feature.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     
  • Commit 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor
    summary") intended to fix a problem with user tool looping when max
    number of bearers are enabled.

    Unfortunately, the wrong version of the commit was posted, so the
    problem was not solved at all.

    This commit adds the missing part.

    Fixes: 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor summary")
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

22 Apr, 2018

1 commit


20 Apr, 2018

1 commit

  • Currently, we have option to configure MTU of UDP media. The configured
    MTU takes effect on the links going up after that moment. I.e, a user
    has to reset bearer to have new value applied across its links. This is
    confusing and disturbing on a running cluster.

    We now introduce the functionality to change the default UDP bearer MTU
    in struct tipc_bearer. Additionally, the links are updated dynamically,
    without any need for a reset, when bearer value is changed. We leverage
    the existing per-link functionality and the design being symetrical to
    the confguration of link tolerance.

    Acked-by: Jon Maloy
    Signed-off-by: GhantaKrishnamurthy MohanKrishna
    Signed-off-by: David S. Miller

    GhantaKrishnamurthy MohanKrishna
     

19 Apr, 2018

1 commit

  • When configuring the number of used bearers to MAX_BEARER and issuing
    command "tipc link monitor summary", the command enters infinite loop
    in user space.

    This issue happens because function tipc_nl_node_dump_monitor() returns
    the wrong 'prev_bearer' value when all potential monitors have been
    scanned.

    The correct behavior is to always try to scan all monitors until either
    the netlink message is full, in which case we return the bearer identity
    of the affected monitor, or we continue through the whole bearer array
    until we can return MAX_BEARERS. This solution also caters for the case
    where there may be gaps in the bearer array.

    Signed-off-by: Tung Nguyen
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     

01 Apr, 2018

2 commits

  • With the new RB tree structure for service ranges it becomes possible to
    solve an old problem; - we can now allow overlapping service ranges in
    the table.

    When inserting a new service range to the tree, we use 'lower' as primary
    key, and when necessary 'upper' as secondary key.

    Since there may now be multiple service ranges matching an indicated
    'lower' value, we must also add the 'upper' value to the functions
    used for removing publications, so that the correct, corresponding
    range item can be found.

    These changes guarantee that a well-formed publication/withdrawal item
    from a peer node never will be rejected, and make it possible to
    eliminate the problematic backlog functionality we currently have for
    handling such cases.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     
  • The current design of the binding table has an unnecessary memory
    consuming and complex data structure. It aggregates the service range
    items into an array, which is expanded by a factor two every time it
    becomes too small to hold a new item. Furthermore, the arrays never
    shrink when the number of ranges diminishes.

    We now replace this array with an RB tree that is holding the range
    items as tree nodes, each range directly holding a list of bindings.

    This, along with a few name changes, improves both readability and
    volume of the code, as well as reducing memory consumption and hopefully
    improving cache hit rate.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

27 Mar, 2018

1 commit