03 Dec, 2016

1 commit

  • Qian Zhang (张谦) reported a potential socket buffer overflow in
    tipc_msg_build() which is also known as CVE-2016-8632: due to
    insufficient checks, a buffer overflow can occur if MTU is too short for
    even tipc headers. As anyone can set device MTU in a user/net namespace,
    this issue can be abused by a regular user.

    As agreed in the discussion on Ben Hutchings' original patch, we should
    check the MTU at the moment a bearer is attached rather than for each
    processed packet. We also need to repeat the check when bearer MTU is
    adjusted to new device MTU. UDP case also needs a check to avoid
    overflow when calculating bearer MTU.

    Fixes: b97bf3fd8f6a ("[TIPC] Initial merge")
    Signed-off-by: Michal Kubecek
    Reported-by: Qian Zhang (张谦)
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Michal Kubeček
     

28 Nov, 2016

1 commit

  • In commit e4bf4f76962b ("tipc: simplify packet sequence number
    handling") we changed the internal representation of the packet
    sequence number counters from u32 to u16, reflecting what is really
    sent over the wire.

    Since then some link statistics counters have been displaying incorrect
    values, partially because the counters meant to be used as sequence
    number snapshots are now used as direct counters, stored as u32, and
    partially because some counter updates are just missing in the code.

    In this commit we correct this in two ways. First, we base the
    displayed packet sent/received values on direct counters instead
    of as previously a calculated difference between current sequence
    number and a snapshot. Second, we add the missing updates of the
    counters.

    This change is compatible with the current netlink API, and requires
    no changes to the user space tools.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

26 Nov, 2016

3 commits

  • In commit 10724cc7bb78 ("tipc: redesign connection-level flow control")
    we replaced the previous message based flow control with one based on
    1k blocks. In order to ensure backwards compatibility the mechanism
    falls back to using message as base unit when it senses that the peer
    doesn't support the new algorithm. The default flow control window,
    i.e., how many units can be sent before the sender blocks and waits
    for an acknowledge (aka advertisement) is 512. This was tested against
    the previous version, which uses an acknowledge frequency of on ack per
    256 received message, and found to work fine.

    However, we missed the fact that versions older than Linux 3.15 use an
    acknowledge frequency of 512, which is exactly the limit where a 4.6+
    sender will stop and wait for acknowledge. This would also work fine if
    it weren't for the fact that if the first sent message on a 4.6+ server
    side is an empty SYNACK, this one is also is counted as a sent message,
    while it is not counted as a received message on a legacy 3.15-receiver.
    This leads to the sender always being one step ahead of the receiver, a
    scenario causing the sender to block after 512 sent messages, while the
    receiver only has registered 511 read messages. Hence, the legacy
    receiver is not trigged to send an acknowledge, with a permanently
    blocked sender as result.

    We solve this deadlock by simply allowing the sender to send one more
    message before it blocks, i.e., by a making minimal change to the
    condition used for determining connection congestion.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • In commit 35c55c9877f8 ("tipc: add neighbor monitoring framework") we
    added a data area to the link monitor STATE messages under the
    assumption that previous versions did not use any such data area.

    For versions older than Linux 4.3 this assumption is not correct. In
    those version, all STATE messages sent out from a node inadvertently
    contain a 16 byte data area containing a string; -a leftover from
    previous RESET messages which were using this during the setup phase.
    This string serves no purpose in STATE messages, and should no be there.

    Unfortunately, this data area is delivered to the link monitor
    framework, where a sanity check catches that it is not a correct domain
    record, and drops it. It also issues a rate limited warning about the
    event.

    Since such events occur much more frequently than anticipated, we now
    choose to remove the warning in order to not fill the kernel log with
    useless contents. We also make the sanity check stricter, to further
    reduce the risk that such data is inavertently admitted.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • commit 817298102b0b ("tipc: fix link priority propagation") introduced a
    compatibility problem between TIPC versions newer than Linux 4.6 and
    those older than Linux 4.4. In versions later than 4.4, link STATE
    messages only contain a non-zero link priority value when the sender
    wants the receiver to change its priority. This has the effect that the
    receiver resets itself in order to apply the new priority. This works
    well, and is consistent with the said commit.

    However, in versions older than 4.4 a valid link priority is present in
    all sent link STATE messages, leading to cyclic link establishment and
    reset on the 4.6+ node.

    We fix this by adding a test that the received value should not only
    be valid, but also differ from the current value in order to cause the
    receiving link endpoint to reset.

    Reported-by: Amar Nv
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

20 Nov, 2016

1 commit


30 Oct, 2016

1 commit

  • In commit 2d18ac4ba745 ("tipc: extend broadcast link initialization
    criteria") we tried to fix a problem with the initial synchronization
    of broadcast link acknowledge values. Unfortunately that solution is
    not sufficient to solve the issue.

    We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
    non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
    initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
    broadcast acknowledge numbers, leading to premature opening of the
    broadcast link. When the bypassed packets finally arrive, they are
    inadvertently accepted, and the already correctly initialized
    acknowledge number in the broadcast receive link is overwritten by
    the invalid (zero) value of the said packets. After this the broadcast
    link goes stale.

    We now fix this by marking the packets where we know the acknowledge
    value is or may be invalid, and then ignoring the acks from those.

    To this purpose, we claim an unused bit in the header to indicate that
    the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
    synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
    packets, plus those LINK_PROTOCOL packets sent out before the broadcast
    links are fully synchronized.

    This minor protocol update is fully backwards compatible.

    Reported-by: John Thompson
    Tested-by: John Thompson
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

14 Oct, 2016

1 commit


13 Sep, 2016

2 commits


03 Sep, 2016

3 commits

  • Because of the risk of an excessive number of NACK messages and
    retransissions, receivers have until now abstained from sending
    broadcast NACKS directly upon detection of a packet sequence number
    gap. We have instead relied on such gaps being detected by link
    protocol STATE message exchange, something that by necessity delays
    such detection and subsequent retransmissions.

    With the introduction of unicast NACK transmission and rate control
    of retransmissions we can now remove this limitation. We now allow
    receiving nodes to send NACKS immediately, while coordinating the
    permission to do so among the nodes in order to avoid NACK storms.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As cluster sizes grow, so does the amount of identical or overlapping
    broadcast NACKs generated by the packet receivers. This often leads to
    'NACK crunches' resulting in huge numbers of redundant retransmissions
    of the same packet ranges.

    In this commit, we introduce rate control of broadcast retransmissions,
    so that a retransmitted range cannot be retransmitted again until after
    at least 10 ms. This reduces the frequency of duplicate, redundant
    retransmissions by an order of magnitude, while having a significant
    positive impact on overall throughput and scalability.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • When we send broadcasts in clusters of more 70-80 nodes, we sometimes
    see the broadcast link resetting because of an excessive number of
    retransmissions. This is caused by a combination of two factors:

    1) A 'NACK crunch", where loss of broadcast packets is discovered
    and NACK'ed by several nodes simultaneously, leading to multiple
    redundant broadcast retransmissions.

    2) The fact that the NACKS as such also are sent as broadcast, leading
    to excessive load and packet loss on the transmitting switch/bridge.

    This commit deals with the latter problem, by moving sending of
    broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
    to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
    bits in word 8 of the said message for this purpose, and introduce a
    new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
    backwards compatible.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

02 Sep, 2016

1 commit

  • In a dual bearer configuration, if the second tipc link becomes
    active while the first link still has pending nametable "bulk"
    updates, it randomly leads to reset of the second link.

    When a link is established, the function named_distribute(),
    fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
    with NAME_DISTRIBUTOR message for each PUBLICATION.
    However, the function named_distribute() allocates the buffer by
    increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
    This consumes the space allocated for TUNNEL_PROTOCOL.

    When establishing the second link, the link shall tunnel all the
    messages in the first link queue including the "bulk" update.
    As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
    the link mtu the transmission fails (-EMSGSIZE).

    Thus, the synch point based on the message count of the tunnel
    packets is never reached leading to link timeout.

    In this commit, we adjust the size of name distributor message so that
    they can be tunnelled.

    Reviewed-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

30 Aug, 2016

1 commit


27 Aug, 2016

7 commits

  • When using replicast a UDP bearer can have an arbitrary amount of
    remote ip addresses associated with it. This means we cannot simply
    add all remote ip addresses to an existing bearer data message as it
    might fill the message, leaving us with a truncated message that we
    can't safely resume. To handle this we introduce the new netlink
    command TIPC_NL_UDP_GET_REMOTEIP. This command is intended to be
    called when the bearer data message has the
    TIPC_NLA_UDP_MULTI_REMOTEIP flag set, indicating there are more than
    one remote ip (replicast).

    Signed-off-by: Richard Alpe
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Add UDP bearer options to netlink bearer get message. This is used by
    the tipc user space tool to display UDP options.

    The UDP bearer information is passed using either a sockaddr_in or
    sockaddr_in6 structs. This means the user space receiver should
    intermediately store the retrieved data in a large enough struct
    (sockaddr_strage) before casting to the proper IP version type.

    Signed-off-by: Richard Alpe
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Automatically learn UDP remote IP addresses of communicating peers by
    looking at the source IP address of incoming TIPC link configuration
    messages (neighbor discovery).

    This makes configuration slightly easier and removes the problematic
    scenario where a node receives directly addressed neighbor discovery
    messages sent using replicast which the node cannot "reply" to using
    mutlicast, leaving the link FSM in a limbo state.

    Signed-off-by: Richard Alpe
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • This patch introduces UDP replicast. A concept where we emulate
    multicast by sending multiple unicast messages to configured peers.

    The purpose of replicast is mainly to be able to use TIPC in cloud
    environments where IP multicast is disabled. Using replicas to unicast
    multicast messages is costly as we have to copy each skb and send the
    copies individually.

    Signed-off-by: Richard Alpe
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Add a function to check if a tipc UDP media address is a multicast
    address or not. This is a purely cosmetic change.

    Signed-off-by: Richard Alpe
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Split the UDP send function into two. One callback that prepares the
    skb and one transmit function that sends the skb. This will come in
    handy in later patches, when we introduce UDP replicast.

    Signed-off-by: Richard Alpe
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • Split the UDP netlink parse function so that it only parses one
    netlink attribute at the time. This makes the parse function more
    generic and allow future UDP API functions to use it for parsing.

    Signed-off-by: Richard Alpe
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     

26 Aug, 2016

1 commit


24 Aug, 2016

1 commit


19 Aug, 2016

3 commits

  • Add TIPC_NL_PEER_REMOVE netlink command. This command can remove
    an offline peer node from the internal data structures.

    This will be supported by the tipc user space tool in iproute2.

    Signed-off-by: Richard Alpe
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • When a link is attempted woken up after congestion, it uses a different,
    more generous criteria than when it was originally declared congested.
    This has the effect that the link, and the sending process, sometimes
    will be woken up unnecessarily, just to immediately return to congestion
    when it turns out there is not not enough space in its send queue to
    host the pending message. This is a waste of CPU cycles.

    We now change the function link_prepare_wakeup() to use exactly the same
    criteria as tipc_link_xmit(). However, since we are now excluding the
    window limit from the wakeup calculation, and the current backlog limit
    for the lowest level is too small to house even a single maximum-size
    message, we have to expand this limit. We do this by evaluating an
    alternative, minimum value during the setting of the importance limits.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • In commit 5b7066c3dd24 ("tipc: stricter filtering of packets in bearer
    layer") we introduced a method of filtering out messages while a bearer
    is being reset, to avoid that links may be re-created and come back in
    working state while we are still in the process of shutting them down.

    This solution works well, but is limited to only work with L2 media, which
    is insufficient with the increasing use of UDP as carrier media.

    We now replace this solution with a more generic one, by introducing a
    new flag "up" in the generic struct tipc_bearer. This field will be set
    and reset at the same locations as with the previous solution, while
    the packet filtering is moved to the generic code for the sending side.
    On the receiving side, the filtering is still done in media specific
    code, but now including the UDP bearer.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

16 Aug, 2016

1 commit

  • tipc_msg_create() can return a NULL skb and if so, we shouldn't try to
    call tipc_node_xmit_skb() on it.

    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 3 PID: 30298 Comm: trinity-c0 Not tainted 4.7.0-rc7+ #19
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff8800baf09980 ti: ffff8800595b8000 task.ti: ffff8800595b8000
    RIP: 0010:[] [] tipc_node_xmit_skb+0x6b/0x140
    RSP: 0018:ffff8800595bfce8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003023b0e0
    RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffffff83d12580
    RBP: ffff8800595bfd78 R08: ffffed000b2b7f32 R09: 0000000000000000
    R10: fffffbfff0759725 R11: 0000000000000000 R12: 1ffff1000b2b7f9f
    R13: ffff8800595bfd58 R14: ffffffff83d12580 R15: dffffc0000000000
    FS: 00007fcdde242700(0000) GS:ffff88011af80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fcddde1db10 CR3: 000000006874b000 CR4: 00000000000006e0
    DR0: 00007fcdde248000 DR1: 00007fcddd73d000 DR2: 00007fcdde248000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
    Stack:
    0000000000000018 0000000000000018 0000000041b58ab3 ffffffff83954208
    ffffffff830bb400 ffff8800595bfd30 ffffffff8309d767 0000000000000018
    0000000000000018 ffff8800595bfd78 ffffffff8309da1a 00000000810ee611
    Call Trace:
    [] tipc_shutdown+0x553/0x880
    [] SyS_shutdown+0x14b/0x170
    [] do_syscall_64+0x19c/0x410
    [] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 90 00 b4 0b 83 c7 00 f1 f1 f1 f1 4c 8d 6d e0 c7 40 04 00 00 00 f4 c7 40 08 f3 f3 f3 f3 48 89 d8 48 c1 e8 03 c7 45 b4 00 00 00 00 3c 30 00 75 78 48 8d 7b 08 49 8d 75 c0 48 b8 00 00 00 00 00
    RIP [] tipc_node_xmit_skb+0x6b/0x140
    RSP
    ---[ end trace 57b0484e351e71f1 ]---

    I feel like we should maybe return -ENOMEM or -ENOBUFS, but I'm not sure
    userspace is equipped to handle that. Anyway, this is better than a GPF
    and looks somewhat consistent with other tipc_msg_create() callers.

    Signed-off-by: Vegard Nossum
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Vegard Nossum
     

11 Aug, 2016

1 commit

  • In commit cf6f7e1d5109 ("tipc: dump monitor attributes"),
    I dereferenced a pointer before checking if its valid.
    This is reported by static check Smatch as:
    net/tipc/monitor.c:733 tipc_nl_add_monitor_peer()
    warn: variable dereferenced before check 'mon' (see line 731)

    In this commit, we check for a valid monitor before proceeding
    with any other operation.

    Fixes: cf6f7e1d5109 ("tipc: dump monitor attributes")
    Reported-by: Dan Carpenter
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

31 Jul, 2016

1 commit

  • In the error handling case of nla_nest_start() failed read_unlock_bh()
    is called to unlock a lock that had not been taken yet. sparse warns
    about the context imbalance as the following:

    net/tipc/monitor.c:799:23: warning:
    context imbalance in '__tipc_nl_add_monitor' - different lock contexts for basic block

    Fixes: cf6f7e1d5109 ('tipc: dump monitor attributes')
    Signed-off-by: Wei Yongjun
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Wei Yongjun
     

27 Jul, 2016

5 commits


24 Jul, 2016

1 commit


12 Jul, 2016

3 commits

  • In test situations with many nodes and a heavily stressed system we have
    observed that the transmission broadcast link may fail due to an
    excessive number of retransmissions of the same packet. In such
    situations we need to reset all unicast links to all peers, in order to
    reset and re-synchronize the broadcast link.

    In this commit, we add a new function tipc_bearer_reset_all() to be used
    in such situations. The function scans across all bearers and resets all
    their pertaining links.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • After a new receiver peer has been added to the broadcast transmission
    link, we allow immediate transmission of new broadcast packets, trusting
    that the new peer will not accept the packets until it has received the
    previously sent unicast broadcast initialiation message. In the same
    way, the sender must not accept any acknowledges until it has itself
    received the broadcast initialization from the peer, as well as
    confirmation of the reception of its own initialization message.

    Furthermore, when a receiver peer goes down, the sender has to produce
    the missing acknowledges from the lost peer locally, in order ensure
    correct release of the buffers that were expected to be acknowledged by
    the said peer.

    In a highly stressed system we have observed that contact with a peer
    may come up and be lost before the above mentioned broadcast initial-
    ization and confirmation have been received. This leads to the locally
    produced acknowledges being rejected, and the non-acknowledged buffers
    to linger in the broadcast link transmission queue until it fills up
    and the link goes into permanent congestion.

    In this commit, we remedy this by temporarily setting the corresponding
    broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
    state to true before we issue the local acknowledges. This ensures that
    those acknowledges will always be accepted. The mentioned state values
    are restored immediately afterwards when the link is reset.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • At first contact between two nodes, an endpoint might sometimes have
    time to send out a LINK_PROTOCOL/STATE packet before it has received
    the broadcast initialization packet from the peer, i.e., before it has
    received a valid broadcast packet number to add to the 'bc_ack' field
    of the protocol message.

    This means that the peer endpoint will receive a protocol packet with an
    invalid broadcast acknowledge value of 0. Under unlucky circumstances
    this may lead to the original, already received acknowledge value being
    overwritten, so that the whole broadcast link goes stale after a while.

    We fix this by delaying the setting of the link field 'bc_peer_is_up'
    until we know that the peer really has received our own broadcast
    initialization message. The latter is always sent out as the first
    unicast message on a link, and always with seqeunce number 1. Because
    of this, we only need to look for a non-zero unicast acknowledge value
    in the arriving STATE messages, and once that is confirmed we know we
    are safe and can set the mentioned field. Before this moment, we must
    ignore all broadcast acknowledges from the peer.

    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

07 Jul, 2016

1 commit