07 May, 2013

2 commits


04 May, 2013

3 commits

  • When sending packets, TIPC bearers use skb_clone() before writing their
    hardware header. This will however NOT copy the data buffer.
    So when the same packet is sent over multiple bearers (to reach multiple
    nodes), the same socket buffer data will be treated by multiple
    tipc_media drivers which will write their own hardware header through
    dev_hard_header().
    Most of the time this is not a problem, because by the time the
    packet is processed by the second media, it has already been sent over
    the first one. However, when the first transmission is delayed (e.g.
    because of insufficient bandwidth or through a shaper), the next bearer
    will overwrite the hardware header, resulting in the packet being sent:
    a) with the wrong source address, when bearers of the same type,
    e.g. ethernet, are involved
    b) with a completely corrupt header, or even dropped, when bearers of
    different types are involved.

    So when the same socket buffer is to be sent multiple times, send a
    pskb_copy() instead (from the second instance on), and release it
    afterwards (the bearer will skb_clone() it anyway).

    Signed-off-by: Gerlando Falauto
    Signed-off-by: David S. Miller

    Gerlando Falauto
     
  • Signed-off-by: Gerlando Falauto
    Signed-off-by: David S. Miller

    Gerlando Falauto
     
  • Signed-off-by: Gerlando Falauto
    Signed-off-by: David S. Miller

    Gerlando Falauto
     

18 Apr, 2013

4 commits

  • Add InfiniBand media type based on the ethernet media type.

    The only real difference is that in case of InfiniBand, we need the entire
    20 bytes of space reserved for media addresses, so the TIPC media type ID is
    not explicitly stored in the packet payload.

    Sample output of tipc-config:

    # tipc-config -v -addr -netid -nt=all -p -m -b -n -ls

    node address:
    current network id: 4711
    Type Lower Upper Port Identity Publication Scope
    0 167776257 167776257 1855512578 cluster
    167776260 167776260 1216454658 zone
    1 1 1 1216479236 node
    Ports:
    1216479235: bound to {1,1}
    1216454657: bound to {0,167776260}
    Media:
    eth
    ib
    Bearers:
    ib:ib0
    Nodes known:
    : up
    Link
    Window:20 packets
    RX packets:0 fragments:0/0 bundles:0/0
    TX packets:0 fragments:0/0 bundles:0/0
    RX naks:0 defs:0 dups:0
    TX naks:0 acks:0 dups:0
    Congestion bearer:0 link:0 Send queue max:0 avg:0

    Link
    ACTIVE MTU:2044 Priority:10 Tolerance:1500 ms Window:50 packets
    RX packets:80 fragments:0/0 bundles:0/0
    TX packets:40 fragments:0/0 bundles:0/0
    TX profile sample:22 packets average:54 octets
    0-64:100% -256:0% -1024:0% -4096:0% -16384:0% -32768:0% -66000:0%
    RX states:410 probes:213 naks:0 defs:0 dups:0
    TX states:410 probes:197 naks:0 acks:0 dups:0
    Congestion bearer:0 link:0 Send queue max:1 avg:0

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • The skb->protocol field is used by packet classifiers and for AF_PACKET
    cooked format, TIPC needs to set it properly.

    Fixes packet classification and ethertype of 0x0000 in cooked captures:

    Out 20:c9:d0:43:12:d9 ethertype Unknown (0x0000), length 56:
    0x0000: 5b50 0028 0000 30d4 0100 1000 0100 1001 [P.(..0.........
    0x0010: 0000 03e8 0000 0001 20c9 d043 12d9 0000 ...........C....
    0x0020: 0000 0000 0000 0000 ........

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Some network protocols, like InfiniBand, don't have a fixed broadcast
    address but one that depends on the configuration. Move the bcast_addr
    to struct tipc_bearer and initialize it with the broadcast address of
    the network device when the bearer is enabled.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

08 Apr, 2013

2 commits

  • Conflicts:
    drivers/nfc/microread/mei.c
    net/netfilter/nfnetlink_queue_core.c

    Pull in 'net' to get Eric Biederman's AF_UNIX fix, upon which
    some cleanups are going to go on-top.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The code in set_orig_addr() does not initialize all of the members of
    struct sockaddr_tipc when filling the sockaddr info -- namely the union
    is only partly filled. This will make recv_msg() and recv_stream() --
    the only users of this function -- leak kernel stack memory as the
    msg_name member is a local variable in net/socket.c.

    Additionally to that both recv_msg() and recv_stream() fail to update
    the msg_namelen member to 0 while otherwise returning with 0, i.e.
    "success". This is the case for, e.g., non-blocking sockets. This will
    lead to a 128 byte kernel stack leak in net/socket.c.

    Fix the first issue by initializing the memory of the union with
    memset(0). Fix the second one by setting msg_namelen to 0 early as it
    will be updated later if we're going to fill the msg_name member.

    Cc: Jon Maloy
    Cc: Allan Stephens
    Signed-off-by: Mathias Krause
    Signed-off-by: David S. Miller

    Mathias Krause
     

29 Mar, 2013

1 commit


28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Feb, 2013

1 commit

  • Pull driver core patches from Greg Kroah-Hartman:
    "Here is the big driver core merge for 3.9-rc1

    There are two major series here, both of which touch lots of drivers
    all over the kernel, and will cause you some merge conflicts:

    - add a new function called devm_ioremap_resource() to properly be
    able to check return values.

    - remove CONFIG_EXPERIMENTAL

    Other than those patches, there's not much here, some minor fixes and
    updates"

    Fix up trivial conflicts

    * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
    base: memory: fix soft/hard_offline_page permissions
    drivercore: Fix ordering between deferred_probe and exiting initcalls
    backlight: fix class_find_device() arguments
    TTY: mark tty_get_device call with the proper const values
    driver-core: constify data for class_find_device()
    firmware: Ignore abort check when no user-helper is used
    firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
    firmware: Make user-mode helper optional
    firmware: Refactoring for splitting user-mode helper code
    Driver core: treat unregistered bus_types as having no devices
    watchdog: Convert to devm_ioremap_resource()
    thermal: Convert to devm_ioremap_resource()
    spi: Convert to devm_ioremap_resource()
    power: Convert to devm_ioremap_resource()
    mtd: Convert to devm_ioremap_resource()
    mmc: Convert to devm_ioremap_resource()
    mfd: Convert to devm_ioremap_resource()
    media: Convert to devm_ioremap_resource()
    iommu: Convert to devm_ioremap_resource()
    drm: Convert to devm_ioremap_resource()
    ...

    Linus Torvalds
     

19 Feb, 2013

1 commit


16 Feb, 2013

4 commits

  • As the number of iovecs in a send request is already limited within
    UIO_MAXIOV(i.e. 1024) in __sys_sendmsg(), it's unnecessary to check it
    again in TIPC stack.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • Change overload control to be purely byte-based, using
    sk->sk_rmem_alloc as byte counter, and compare it to a calculated
    upper limit for the socket receive queue.

    For all connection messages, irrespective of message importance,
    the overload limit is set to a constant value (i.e, 67MB). This
    limit should normally never be reached because of the lower
    limit used by the flow control algorithm, and is there only
    as a last resort in case a faulty peer doesn't respect the send
    window limit.

    For datagram messages, message importance is taken into account
    when calculating the overload limit. The calculation is based
    on sk->sk_rcvbuf, and is hence configurable via the socket option
    SO_RCVBUF.

    Cc: Neil Horman
    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • The tipc function discard_rx_queue() is just a duplicated
    implementation of __skb_queue_purge(). Remove the former
    and directly invoke __skb_queue_purge().

    In doing so, the underscores convey to the code reader, more
    information about the current locking state that is assumed.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • After commit 3c294cb3 "tipc: remove the bearer congestion mechanism",
    we try to grab the broadcast bearer lock when sending multicast
    messages over the broadcast link. This will cause an oops because
    the lock is never initialized. This is an old bug, but the lock
    was never actually used before commit 3c294cb3, so that why it was
    not visible until now. The oops will look something like:

    BUG: spinlock bad magic on CPU#2, daemon/147
    lock: bcast_bearer+0x48/0xffffffffffffd19a [tipc],
    .magic: 00000000, .owner: /-1, .owner_cpu: 0
    Pid: 147, comm: daemon Not tainted 3.8.0-rc3+ #206
    Call Trace:
    spin_dump+0x8a/0x8f
    spin_bug+0x21/0x26
    do_raw_spin_lock+0x114/0x150
    _raw_spin_lock_bh+0x19/0x20
    tipc_bearer_blocked+0x1f/0x40 [tipc]
    tipc_link_send_buf+0x82/0x280 [tipc]
    ? __alloc_skb+0x9f/0x2b0
    tipc_bclink_send_msg+0x77/0xa0 [tipc]
    tipc_multicast+0x11b/0x1b0 [tipc]
    send_msg+0x225/0x530 [tipc]
    sock_sendmsg+0xca/0xe0

    The above can be triggered by running the multicast demo program.

    Signed-off-by: Erik Hugne
    Signed-off-by: Paul Gortmaker
    Signed-off-by: David S. Miller

    Erik Hugne
     

12 Jan, 2013

1 commit

  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: Jon Maloy
    CC: Allan Stephens
    CC: "David S. Miller"
    Signed-off-by: Kees Cook
    Acked-by: David S. Miller

    Kees Cook
     

08 Dec, 2012

8 commits

  • In TIPC's accept() routine, there is a large block of code relating
    to initialization of a new socket, all within an if condition checking
    if the allocation succeeded.

    Here, we simply flip the check of the if, so that the main execution
    path stays at the same indentation level, which improves readability.
    If the allocation fails, we jump to an already existing exit label.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • TIPC accept() call grabs the socket lock on a newly allocated
    socket while holding the socket lock on an old socket. But lockdep
    worries that this might be a recursive lock attempt:

    [ INFO: possible recursive locking detected ]
    ---------------------------------------------
    kworker/u:0/6 is trying to acquire lock:
    (sk_lock-AF_TIPC){+.+.+.}, at: [] accept+0x15c/0x310 [tipc]

    but task is already holding lock:
    (sk_lock-AF_TIPC){+.+.+.}, at: [] accept+0x28/0x310 [tipc]

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(sk_lock-AF_TIPC);
    lock(sk_lock-AF_TIPC);

    *** DEADLOCK ***

    May be due to missing lock nesting notation
    [...]

    Tell lockdep that this locking is safe by using lock_sock_nested().
    This is similar to what was done in commit 5131a184a3458d9 for
    SCTP code ("SCTP: lock_sock_nested in sctp_sock_migrate").

    Also note that this is isn't something that is seen normally,
    as it was uncovered with some experimental work-in-progress
    code not yet ready for mainline. So no need for stable
    backports or similar of this commit.

    Signed-off-by: Ying Xue
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • As connection setup is now completed asynchronously in BH context,
    in the function filter_connect(), the corresponding code in recv_msg()
    becomes redundant.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • TIPC has so far only supported blocking connect(), meaning that a call
    to connect() doesn't return until either the connection is fully
    established, or an error occurs. This has proved insufficient for many
    users, so we now introduce non-blocking connect(), analogous to how
    this is done in TCP and other protocols.

    With this feature, if a connection cannot be established instantly,
    connect() will return the error code "-EINPROGRESS".
    If the user later calls connect() again, he will either have the
    return code "-EALREADY" or "-EISCONN", depending on whether the
    connection has been established or not.

    The user must have explicitly set the socket to be non-blocking
    (SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
    for some reason they had set this already (the socket would anyway
    remain blocking in current TIPC) this change should be completely
    backwards compatible.

    It is also now possible to call select() or poll() to wait for the
    completion of a connection.

    An effect of the above is that the actual completion of a connection
    may now be performed asynchronously, independent of the calls from
    user space. Therefore, we now execute this code in BH context, in
    the function filter_rcv(), which is executed upon reception of
    messages in the socket.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    [PG: minor refactoring for improved connect/disconnect function names]
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • Handling of connection-related message reception is currently scattered
    around at different places in the code. This makes it harder to verify
    that things are handled correctly in all possible scenarios.
    So we consolidate the existing processing of connection-oriented
    message reception in a single routine. In the process, we convert the
    chain of if/else into a switch/case for improved readability.

    A cast on the socket_state in the switch is needed to avoid compile
    warnings on 32 bit, like "net/tipc/socket.c:1252:2: warning: case value
    ‘4294967295’ not in enumerated type". This happens because existing
    tipc code pseudo extends the default linux socket state values with:

    #define SS_LISTENING -1 /* socket is listening */
    #define SS_READY -2 /* socket is connectionless */

    It may make sense to add these as _positive_ values to the existing
    socket state enum list someday, vs. these already existing defines.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    [PG: add cast to fix warning; remove returns from middle of switch]
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • Currently we have tipc_disconnect and tipc_disconnect_port. It is
    not clear from the names alone, what they do or how they differ.
    It turns out that tipc_disconnect just deals with the port locking
    and then calls tipc_disconnect_port which does all the work.

    If we rename as follows: tipc_disconnect_port --> __tipc_disconnect
    then we will be following typical linux convention, where:

    __tipc_disconnect: "raw" function that does all the work.

    tipc_disconnect: wrapper that deals with locking and then calls
    the real core __tipc_disconnect function

    With this, the difference is immediately evident, and locking
    violations are more apt to be spotted by chance while working on,
    or even just while reading the code.

    On the connect side of things, we currently only have the single
    "tipc_connect2port" function. It does both the locking at enter/exit,
    and the core of the work. Pending changes will make it desireable to
    have the connect be a two part locking wrapper + worker function,
    just like the disconnect is already.

    Here, we make the connect look just like the updated disconnect case,
    for the above reason, and for consistency. In the process, we also
    get rid of the "2port" suffix that was on the original name, since
    it adds no descriptive value.

    On close examination, one might notice that the above connect
    changes implicitly move the call to tipc_link_get_max_pkt() to be
    within the scope of tipc_port_lock() protected region; when it was
    not previously. We don't see any issues with this, and it is in
    keeping with __tipc_connect doing the work and tipc_connect just
    handling the locking.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • The sk_recv_queue upper limit for connectionless sockets has empirically
    turned out to be too low. When we double the current limit we get much
    fewer rejected messages and no noticable negative side-effects.

    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Jon Maloy
     
  • As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
    global atomic counter for the sum of sk_recv_queue sizes across all
    tipc sockets. When incremented, the counter is compared to an upper
    threshold value, and if this is reached, the message is rejected
    with error code TIPC_OVERLOAD.

    This check was originally meant to protect the node against
    buffer exhaustion and general CPU overload. However, all experience
    indicates that the feature not only is redundant on Linux, but even
    harmful. Users run into the limit very often, causing disturbances
    for their applications, while removing it seems to have no negative
    effects at all. We have also seen that overall performance is
    boosted significantly when this bottleneck is removed.

    Furthermore, we don't see any other network protocols maintaining
    such a mechanism, something strengthening our conviction that this
    control can be eliminated.

    As a result, the atomic variable tipc_queue_size is now unused
    and so it can be deleted. There is a getsockopt call that used
    to allow reading it; we retain that but just return zero for
    maximum compatibility.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Cc: Neil Horman
    [PG: phase out tipc_queue_size as pointed out by Neil Horman]
    Signed-off-by: Paul Gortmaker

    Ying Xue
     

07 Dec, 2012

1 commit

  • Each link instance has a periodic job checking if there is a stale
    ongoing message reassembly associated to the link. If no new
    fragment has been received during the last 4*[link_tolerance] period,
    it is assumed the missing fragment will never arrive. As a consequence,
    the reassembly buffer is discarded, and a gap in the message sequence
    occurs.

    This assumption is wrong. After we abandoned our ambition to develop
    packet routing for multi-cluster networks, only single-hop packet
    transfer remains as an option. For those, all packets are guaranteed
    to be delivered in sequence to the defragmentation layer. Any failure
    to achieve sequenced delivery will eventually lead to link reset, and
    the reassembly buffer will be flushed anyway.

    So we just remove this periodic check, which is now obsolete.

    Signed-off-by: Erik Hugne
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    [PG: also delete get/inc_timer count, since they are now unused]
    Signed-off-by: Paul Gortmaker

    Erik Hugne
     

23 Nov, 2012

3 commits

  • There used to be a time when TIPC had lots of Kconfig knobs the
    end user could alter, but they have all been made automatic or
    obsolete, with the exception of CONFIG_TIPC_PORTS. This
    previously existing set of options was all hidden under the
    TIPC_ADVANCED setting, which does not exist in any code, but
    only in Kconfig scope.

    Having this now, just to hide the one remaining "advanced"
    option no longer makes sense. Remove it. Also get rid of the
    ifdeffery in the TIPC code that allowed for TIPC_PORTS to be
    possibly undefined.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • As the variable:node is currently defined to u32 type, it is
    unnecessary to cast its type to u32 again when using it.

    Signed-off-by: Ying Xue
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • Upon establishing a first link between two nodes, there is
    currently a risk that the two endpoints will disagree on exactly
    which sequence number reception and acknowleding of broadcast
    packets should start.

    The following scenarios may happen:

    1: Node A sends an ACTIVATE message to B, telling it to start acking
    packets from sequence number N.
    2: Node A sends out broadcast N, but does not expect an acknowledge
    from B, since B is not yet in its broadcast receiver's list.
    3: Node A receives ACK for N from all nodes except B, and releases
    packet N.
    4: Node B receives the ACTIVATE, activates its link endpoint, and
    stores the value N as sequence number of first expected packet.
    5: Node B sends a NAME_DISTR message to A.
    6: Node A receives the NAME_DISTR message, and activates its endpoint.
    At this moment B is added to A's broadcast receiver's set.
    Node A also sets sequence number 0 as the first broadcast packet
    to be received from B.
    7: Node A sends broadcast N+1.
    8: B receives N+1, determines there is a gap in the sequence, since
    it is expecting N, and sends a NACK for N back to A.
    9: Node A has already released N, so no retransmission is possible.
    The broadcast link in direction A->B is stale.

    In addition to, or instead of, 7-9 above, the following may happen:

    10: Node B sends broadcast M > 0 to A.
    11: Node A receives M, falsely decides there must be a gap, since
    it is expecting packet 0, and asks for retransmission of packets
    [0,M-1].
    12: Node B has already released these packets, so the broadcast
    link is stale in direction B->A.

    We solve this problem by introducing a new unicast message type,
    BCAST_PROTOCOL/STATE, to convey the sequence number of the next
    sent broadcast packet to the other endpoint, at exactly the moment
    that endpoint is added to the own node's broadcast receivers list,
    and before any other unicast messages are permitted to be sent.

    Furthermore, we don't allow any node to start receiving and
    processing broadcast packets until this new synchronization
    message has been received.

    To maintain backwards compatibility, we still open up for
    broadcast reception if we receive a NAME_DISTR message without
    any preceding broadcast sync message. In this case, we must
    assume that the other end has an older code version, and will
    never send out the new synchronization message. Hence, for mixed
    old and new nodes, the issue arising in 7-12 of the above may
    happen with the same probability as before.

    Signed-off-by: Jon Maloy
    Signed-off-by: Ying Xue
    Signed-off-by: Paul Gortmaker

    Jon Maloy
     

22 Nov, 2012

6 commits

  • Rename the "supported" flag in bclink structure to "recv_permitted"
    to better reflect what it is used for. When this flag is set for a
    given node, we are permitted to receive and acknowledge broadcast
    messages from that node. Convert it to a bool at the same time,
    since it is not used to store any numerical values.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • The "supportable" flag in bclink structure is a compatibility flag
    indicating whether a peer node is capable of receiving TIPC broadcast
    messages. However, all TIPC versions since tipc-1.5, and after the
    inclusion in the upstream Linux kernel in 2006, support this capability.
    It is highly unlikely that anybody is still using such an old
    version of TIPC, let alone that they want to mix it with TIPC-2.0
    nodes. Therefore, we now remove the "supportable" flag.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • Currently at the TIPC bearer layer there is the following congestion
    mechanism:

    Once sending packets has failed via that bearer, the bearer will be
    flagged as being in congested state at once. During bearer congestion,
    all packets arriving at link will be queued on the link's outgoing
    buffer. When we detect that the state of bearer congestion has
    relaxed (e.g. some packets are received from the bearer) we will try
    our best to push all packets in the link's outgoing buffer until the
    buffer is empty, or until the bearer is congested again.

    However, in fact the TIPC bearer never receives any feedback from the
    device layer whether a send was successful or not, so it must always
    assume it was successful. Therefore, the bearer congestion mechanism
    as it exists currently is of no value.

    But the bearer blocking state is still useful for us. For example,
    when the physical media goes down/up, we need to change the state of
    the links bound to the bearer. So the code maintaing the state
    information is not removed.

    Signed-off-by: Ying Xue
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • When a socket is shut down, we should wake up all thread sleeping on
    it, instead of just one of them. Otherwise, when several threads are
    polling the same socket, and one of them does shutdown(), the
    remaining threads may end up sleeping forever.

    Also, to align socket usage with common practice in other stacks, we
    use one of the common socket callback handlers, sk_state_change(),
    to wake up pending users. This is similar to the usage in e.g.
    inet_shutdown(). [net/ipv4/af_inet.c].

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • If an implied connect is attempted on a nonblocking STREAM/SEQPACKET
    socket during link congestion, the connect message will be discarded
    and sendmsg will return EAGAIN. This is normal behavior, and the
    application is expected to poll the socket until POLLOUT is set,
    after which the connection attempt can be retried.
    However, the POLLOUT flag is never set for unconnected sockets and
    poll() always returns a zero mask. The application is then left without
    a trigger for when it can make another attempt at sending the message.

    The solution is to check if we're polling on an unconnected socket
    and set the POLLOUT flag if the TIPC port owned by this socket
    is not congested. The TIPC ports waiting on a specific link will be
    marked as 'not congested' when the link congestion have abated.

    Signed-off-by: Erik Hugne
    Signed-off-by: Paul Gortmaker

    Erik Hugne
     
  • When an application blocks at poll/select on a TIPC socket
    while requesting a specific event mask, both the filter_rcv() and
    wakeupdispatch() case will wake it up unconditionally whenever
    the state changes (i.e an incoming message arrives, or congestion
    has subsided). No mask is used.

    To avoid this, we populate sk->sk_data_ready and sk->sk_write_space
    with tipc_data_ready and tipc_write_space respectively, which makes
    tipc more in alignment with the rest of the networking code. These
    pass the exact set of possible events to the waker in fs/select.c
    hence avoiding waking up blocked processes unnecessarily.

    In doing so, we uncover another issue -- that there needs to be a
    memory barrier in these poll/receive callbacks, otherwise we are
    subject to the the same race as documented above wq_has_sleeper()
    [in commit a57de0b4 "net: adding memory barrier to the poll and
    receive callbacks"]. So we need to replace poll_wait() with
    sock_poll_wait() and use rcu protection for the sk->sk_wq pointer
    in these two new functions.

    Signed-off-by: Ying Xue
    Signed-off-by: Paul Gortmaker

    Ying Xue
     

04 Nov, 2012

1 commit

  • If tasklet_disable() is called before related tasklet handled,
    tasklet_kill will never be finished. tasklet_kill is enough.

    Signed-off-by: Xiaotian Feng
    Cc: Jon Maloy
    Cc: Allan Stephens
    Cc: "David S. Miller"
    Cc: netdev@vger.kernel.org
    Cc: tipc-discussion@lists.sourceforge.net
    Signed-off-by: David S. Miller

    Xiaotian Feng
     

05 Oct, 2012

1 commit

  • When large buffers are sent over connected TIPC sockets, it
    is likely that the sk_backlog will be filled up on the
    receiver side, but the TIPC flow control mechanism is happily
    unaware of this since that is based on message count.

    The sender will receive a TIPC_ERR_OVERLOAD message when this occurs
    and drop it's side of the connection, leaving it stale on
    the receiver end.

    By increasing the sk_rcvbuf to a 'worst case' value, we avoid the
    overload caused by a full backlog queue and the flow control
    will work properly.

    This worst case value is the max TIPC message size times
    the flow control window, multiplied by two because a sender
    will transmit up to double the window size before a port is marked
    congested.
    We multiply this by 2 to account for the sk_buff and other overheads.

    Signed-off-by: Erik Hugne
    Signed-off-by: David S. Miller

    Erik Hugne