08 Dec, 2012

8 commits

  • In TIPC's accept() routine, there is a large block of code relating
    to initialization of a new socket, all within an if condition checking
    if the allocation succeeded.

    Here, we simply flip the check of the if, so that the main execution
    path stays at the same indentation level, which improves readability.
    If the allocation fails, we jump to an already existing exit label.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • TIPC accept() call grabs the socket lock on a newly allocated
    socket while holding the socket lock on an old socket. But lockdep
    worries that this might be a recursive lock attempt:

    [ INFO: possible recursive locking detected ]
    ---------------------------------------------
    kworker/u:0/6 is trying to acquire lock:
    (sk_lock-AF_TIPC){+.+.+.}, at: [] accept+0x15c/0x310 [tipc]

    but task is already holding lock:
    (sk_lock-AF_TIPC){+.+.+.}, at: [] accept+0x28/0x310 [tipc]

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(sk_lock-AF_TIPC);
    lock(sk_lock-AF_TIPC);

    *** DEADLOCK ***

    May be due to missing lock nesting notation
    [...]

    Tell lockdep that this locking is safe by using lock_sock_nested().
    This is similar to what was done in commit 5131a184a3458d9 for
    SCTP code ("SCTP: lock_sock_nested in sctp_sock_migrate").

    Also note that this is isn't something that is seen normally,
    as it was uncovered with some experimental work-in-progress
    code not yet ready for mainline. So no need for stable
    backports or similar of this commit.

    Signed-off-by: Ying Xue
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • As connection setup is now completed asynchronously in BH context,
    in the function filter_connect(), the corresponding code in recv_msg()
    becomes redundant.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • TIPC has so far only supported blocking connect(), meaning that a call
    to connect() doesn't return until either the connection is fully
    established, or an error occurs. This has proved insufficient for many
    users, so we now introduce non-blocking connect(), analogous to how
    this is done in TCP and other protocols.

    With this feature, if a connection cannot be established instantly,
    connect() will return the error code "-EINPROGRESS".
    If the user later calls connect() again, he will either have the
    return code "-EALREADY" or "-EISCONN", depending on whether the
    connection has been established or not.

    The user must have explicitly set the socket to be non-blocking
    (SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
    for some reason they had set this already (the socket would anyway
    remain blocking in current TIPC) this change should be completely
    backwards compatible.

    It is also now possible to call select() or poll() to wait for the
    completion of a connection.

    An effect of the above is that the actual completion of a connection
    may now be performed asynchronously, independent of the calls from
    user space. Therefore, we now execute this code in BH context, in
    the function filter_rcv(), which is executed upon reception of
    messages in the socket.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    [PG: minor refactoring for improved connect/disconnect function names]
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • Handling of connection-related message reception is currently scattered
    around at different places in the code. This makes it harder to verify
    that things are handled correctly in all possible scenarios.
    So we consolidate the existing processing of connection-oriented
    message reception in a single routine. In the process, we convert the
    chain of if/else into a switch/case for improved readability.

    A cast on the socket_state in the switch is needed to avoid compile
    warnings on 32 bit, like "net/tipc/socket.c:1252:2: warning: case value
    ‘4294967295’ not in enumerated type". This happens because existing
    tipc code pseudo extends the default linux socket state values with:

    #define SS_LISTENING -1 /* socket is listening */
    #define SS_READY -2 /* socket is connectionless */

    It may make sense to add these as _positive_ values to the existing
    socket state enum list someday, vs. these already existing defines.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    [PG: add cast to fix warning; remove returns from middle of switch]
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • Currently we have tipc_disconnect and tipc_disconnect_port. It is
    not clear from the names alone, what they do or how they differ.
    It turns out that tipc_disconnect just deals with the port locking
    and then calls tipc_disconnect_port which does all the work.

    If we rename as follows: tipc_disconnect_port --> __tipc_disconnect
    then we will be following typical linux convention, where:

    __tipc_disconnect: "raw" function that does all the work.

    tipc_disconnect: wrapper that deals with locking and then calls
    the real core __tipc_disconnect function

    With this, the difference is immediately evident, and locking
    violations are more apt to be spotted by chance while working on,
    or even just while reading the code.

    On the connect side of things, we currently only have the single
    "tipc_connect2port" function. It does both the locking at enter/exit,
    and the core of the work. Pending changes will make it desireable to
    have the connect be a two part locking wrapper + worker function,
    just like the disconnect is already.

    Here, we make the connect look just like the updated disconnect case,
    for the above reason, and for consistency. In the process, we also
    get rid of the "2port" suffix that was on the original name, since
    it adds no descriptive value.

    On close examination, one might notice that the above connect
    changes implicitly move the call to tipc_link_get_max_pkt() to be
    within the scope of tipc_port_lock() protected region; when it was
    not previously. We don't see any issues with this, and it is in
    keeping with __tipc_connect doing the work and tipc_connect just
    handling the locking.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • The sk_recv_queue upper limit for connectionless sockets has empirically
    turned out to be too low. When we double the current limit we get much
    fewer rejected messages and no noticable negative side-effects.

    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Jon Maloy
     
  • As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
    global atomic counter for the sum of sk_recv_queue sizes across all
    tipc sockets. When incremented, the counter is compared to an upper
    threshold value, and if this is reached, the message is rejected
    with error code TIPC_OVERLOAD.

    This check was originally meant to protect the node against
    buffer exhaustion and general CPU overload. However, all experience
    indicates that the feature not only is redundant on Linux, but even
    harmful. Users run into the limit very often, causing disturbances
    for their applications, while removing it seems to have no negative
    effects at all. We have also seen that overall performance is
    boosted significantly when this bottleneck is removed.

    Furthermore, we don't see any other network protocols maintaining
    such a mechanism, something strengthening our conviction that this
    control can be eliminated.

    As a result, the atomic variable tipc_queue_size is now unused
    and so it can be deleted. There is a getsockopt call that used
    to allow reading it; we retain that but just return zero for
    maximum compatibility.

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Cc: Neil Horman
    [PG: phase out tipc_queue_size as pointed out by Neil Horman]
    Signed-off-by: Paul Gortmaker

    Ying Xue
     

22 Nov, 2012

3 commits

  • When a socket is shut down, we should wake up all thread sleeping on
    it, instead of just one of them. Otherwise, when several threads are
    polling the same socket, and one of them does shutdown(), the
    remaining threads may end up sleeping forever.

    Also, to align socket usage with common practice in other stacks, we
    use one of the common socket callback handlers, sk_state_change(),
    to wake up pending users. This is similar to the usage in e.g.
    inet_shutdown(). [net/ipv4/af_inet.c].

    Signed-off-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: Paul Gortmaker

    Ying Xue
     
  • If an implied connect is attempted on a nonblocking STREAM/SEQPACKET
    socket during link congestion, the connect message will be discarded
    and sendmsg will return EAGAIN. This is normal behavior, and the
    application is expected to poll the socket until POLLOUT is set,
    after which the connection attempt can be retried.
    However, the POLLOUT flag is never set for unconnected sockets and
    poll() always returns a zero mask. The application is then left without
    a trigger for when it can make another attempt at sending the message.

    The solution is to check if we're polling on an unconnected socket
    and set the POLLOUT flag if the TIPC port owned by this socket
    is not congested. The TIPC ports waiting on a specific link will be
    marked as 'not congested' when the link congestion have abated.

    Signed-off-by: Erik Hugne
    Signed-off-by: Paul Gortmaker

    Erik Hugne
     
  • When an application blocks at poll/select on a TIPC socket
    while requesting a specific event mask, both the filter_rcv() and
    wakeupdispatch() case will wake it up unconditionally whenever
    the state changes (i.e an incoming message arrives, or congestion
    has subsided). No mask is used.

    To avoid this, we populate sk->sk_data_ready and sk->sk_write_space
    with tipc_data_ready and tipc_write_space respectively, which makes
    tipc more in alignment with the rest of the networking code. These
    pass the exact set of possible events to the waker in fs/select.c
    hence avoiding waking up blocked processes unnecessarily.

    In doing so, we uncover another issue -- that there needs to be a
    memory barrier in these poll/receive callbacks, otherwise we are
    subject to the the same race as documented above wq_has_sleeper()
    [in commit a57de0b4 "net: adding memory barrier to the poll and
    receive callbacks"]. So we need to replace poll_wait() with
    sock_poll_wait() and use rcu protection for the sk->sk_wq pointer
    in these two new functions.

    Signed-off-by: Ying Xue
    Signed-off-by: Paul Gortmaker

    Ying Xue
     

05 Oct, 2012

1 commit

  • When large buffers are sent over connected TIPC sockets, it
    is likely that the sk_backlog will be filled up on the
    receiver side, but the TIPC flow control mechanism is happily
    unaware of this since that is based on message count.

    The sender will receive a TIPC_ERR_OVERLOAD message when this occurs
    and drop it's side of the connection, leaving it stale on
    the receiver end.

    By increasing the sk_rcvbuf to a 'worst case' value, we avoid the
    overload caused by a full backlog queue and the flow control
    will work properly.

    This worst case value is the max TIPC message size times
    the flow control window, multiplied by two because a sender
    will transmit up to double the window size before a port is marked
    congested.
    We multiply this by 2 to account for the sk_buff and other overheads.

    Signed-off-by: Erik Hugne
    Signed-off-by: David S. Miller

    Erik Hugne
     

14 Jul, 2012

1 commit

  • All messages should go directly to the kernel log. The TIPC
    specific error, warning, info and debug trace macro's are
    removed and all references replaced with pr_err, pr_warn,
    pr_info and pr_debug.

    Commonly used sub-strings are explicitly declared as a const
    char to reduce .text size.

    Note that this means the debug messages (changed to pr_debug),
    are now enabled through dynamic debugging, instead of a TIPC
    specific Kconfig option (TIPC_DEBUG). The latter will be
    phased out completely

    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    [PG: use pr_fmt as suggested by Joe Perches ]
    Signed-off-by: Paul Gortmaker

    Erik Hugne
     

11 Jul, 2012

1 commit


04 Jun, 2012

1 commit

  • Adding casts of objects to the same type is unnecessary
    and confusing for a human reader.

    For example, this cast:

    int y;
    int *p = (int *)&y;

    I used the coccinelle script below to find and remove these
    unnecessary casts. I manually removed the conversions this
    script produces of casts with __force and __user.

    @@
    type T;
    T *p;
    @@

    - (T *)p
    + p

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

01 May, 2012

1 commit

  • Some of the comment blocks are floating in limbo between two
    functions, or between blocks of code. Delete the extra line
    feeds between any comment and its associated following block
    of code, to be consistent with the majority of the rest of
    the kernel. Also delete trailing newlines at EOF and fix
    a couple trivial typos in existing comments.

    This is a 100% cosmetic change with no runtime impact. We get
    rid of over 500 lines of non-code, and being blank line deletes,
    they won't even show up as noise in git blame.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

27 Apr, 2012

1 commit

  • Adds check to ensure TIPC sockets reject incoming payload messages
    that have an unrecognized message type.

    Remove the old open question about whether TIPC_ERR_NO_PORT is
    the proper return value. It is appropriate here since there are
    valid instances where another node can make use of the reply,
    and at this point in time the host is already broadcasting TIPC
    data, so there are no real security concerns.

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     

24 Apr, 2012

1 commit

  • sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
    memory limit. We need to make this limit a parameter for TCP use.

    No functional change expected in this patch, all callers still using the
    old sk_rcvbuf limit.

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Cc: Maciej Żenczykowski
    Cc: Yuchung Cheng
    Cc: Ilpo Järvinen
    Cc: Rick Jones
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Apr, 2012

1 commit

  • Revises routines that deal with connections between two ports on
    the same node to ensure the connection is not impacted if the node's
    network address is changed in mid-operation. The routines now treat
    the default node address of as an alias for "this node" in
    the following situations:

    1) Incoming messages destined to a connected port now handle the alias
    properly when validating that the message was sent by the expected
    peer port, ensuring that the message will be accepted regardless of
    whether it specifies the node's old network address or it's current one.

    2) The code which completes connection establishment now handles the
    alias properly when determining if the peer port is on the same node
    as the connected port.

    An added benefit of addressing issue 1) is that some peer port
    validation code has been relocated to TIPC's socket subsystem, which
    means that validation is no longer done twice when a message is
    sent to a non-socket port (such as TIPC's configuration service or
    network topology service).

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     

16 Apr, 2012

1 commit


25 Feb, 2012

2 commits

  • Gets rid of two inlined routines that simply call existing sk_buff
    manipulation routines, since there is no longer any extra processing
    done by the helper routines.

    Note that these changes are essentially cosmetic in nature, and have
    no impact on the actual operation of TIPC.

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     
  • Streamlines the logic that prevents an application from binding a
    reserved TIPC name type to a port by moving the check to the code
    that handles a socket bind() operation. This allows internal TIPC
    subsystems to bind a reserved name without having to set an atomic
    flag to gain permission to use such a name. (This simplification is
    now possible due to the elimination of support for TIPC's native API.)

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     

28 Dec, 2011

1 commit


01 Nov, 2011

1 commit


18 Sep, 2011

2 commits


01 Sep, 2011

1 commit

  • Saves a socket's TIPC_CONN_TIMEOUT socket option value in its original
    form (milliseconds), rather than jiffies. This ensures that the exact
    value set using setsockopt() is always returned by getsockopt(), without
    being subject to rounding issues introduced by a ms->jiffies->ms
    conversion sequence.

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     

25 Jun, 2011

1 commit


11 May, 2011

2 commits

  • Rework TIPC's message sending routines to take advantage of the total
    amount of data value passed to it by the kernel socket infrastructure.
    This change eliminates the need for TIPC to compute the size of outgoing
    messages itself, as well as the check for an oversize message in
    tipc_msg_build(). In addition, this change warrants an explanation:

    - res = send_packet(NULL, sock, &my_msg, 0);
    + res = send_packet(NULL, sock, &my_msg, bytes_to_send);

    Previously, the final argument to send_packet() was ignored (since the
    amount of data being sent was recalculated by a lower-level routine)
    and we could just pass in a dummy value (0). Now that the
    recalculation is being eliminated, the argument value being passed to
    send_packet() is significant and we have to supply the actual amount
    of data we want to send.

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     
  • Adds checks to TIPC's socket send routines to promptly detect and
    abort attempts to send more than 66,000 bytes in a single TIPC
    message or more than 2**31-1 bytes in a single TIPC byte stream request.
    In addition, this ensures that the number of iovecs in a send request
    does not exceed the limits of a standard integer variable.

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     

14 Mar, 2011

2 commits

  • Adds support for the SO_RCVTIMEO socket option to TIPC's socket
    receive routines.

    Thanks go out to Raj Hegde for his contribution
    to the development and testing this enhancement.

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     
  • Enhances TIPC's socket receive routines to support iovec structures
    containing more than a single entry. This change leverages existing
    sk_buff routines to do most of the work; the only significant change
    to TIPC itself is that an sk_buff now records how much data has been
    already consumed as an numeric offset, rather than as a pointer to
    the first unread data byte.

    Signed-off-by: Allan Stephens
    Signed-off-by: Paul Gortmaker

    Allan Stephens
     

24 Feb, 2011

2 commits


02 Jan, 2011

6 commits