03 May, 2017

3 commits

  • Pull networking updates from David Millar:
    "Here are some highlights from the 2065 networking commits that
    happened this development cycle:

    1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

    2) Add a generic XDP driver, so that anyone can test XDP even if they
    lack a networking device whose driver has explicit XDP support
    (me).

    3) Sparc64 now has an eBPF JIT too (me)

    4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
    Starovoitov)

    5) Make netfitler network namespace teardown less expensive (Florian
    Westphal)

    6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

    7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

    8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

    9) Multiqueue support in stmmac driver (Joao Pinto)

    10) Remove TCP timewait recycling, it never really could possibly work
    well in the real world and timestamp randomization really zaps any
    hint of usability this feature had (Soheil Hassas Yeganeh)

    11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
    Aleksandrov)

    12) Add socket busy poll support to epoll (Sridhar Samudrala)

    13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
    and several others)

    14) IPSEC hw offload infrastructure (Steffen Klassert)"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
    tipc: refactor function tipc_sk_recv_stream()
    tipc: refactor function tipc_sk_recvmsg()
    net: thunderx: Optimize page recycling for XDP
    net: thunderx: Support for XDP header adjustment
    net: thunderx: Add support for XDP_TX
    net: thunderx: Add support for XDP_DROP
    net: thunderx: Add basic XDP support
    net: thunderx: Cleanup receive buffer allocation
    net: thunderx: Optimize CQE_TX handling
    net: thunderx: Optimize RBDR descriptor handling
    net: thunderx: Support for page recycling
    ipx: call ipxitf_put() in ioctl error path
    net: sched: add helpers to handle extended actions
    qed*: Fix issues in the ptp filter config implementation.
    qede: Fix concurrency issue in PTP Tx path processing.
    stmmac: Add support for SIMATIC IOT2000 platform
    net: hns: fix ethtool_get_strings overflow in hns driver
    tcp: fix wraparound issue in tcp_lp
    bpf, arm64: fix jit branch offset related to ldimm64
    bpf, arm64: implement jiting of BPF_XADD
    ...

    Linus Torvalds
     
  • We try to make this function more readable by improving variable names
    and comments, using more stack variables, and doing some smaller changes
    to the logics. We also rename the function to make it consistent with
    naming conventions used elsewhere in the code.

    Reviewed-by: Parthasarathy Bhuvaragan
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • We try to make this function more readable by improving variable names
    and comments, plus some minor changes to the logics.

    Reviewed-by: Parthasarathy Bhuvaragan
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

29 Apr, 2017

3 commits

  • When a socket is shutting down, we notify the peer node about the
    connection termination by reusing an incoming message if possible.
    If the last received message was a connection acknowledgment
    message, we reverse this message and set the error code to
    TIPC_ERR_NO_PORT and send it to peer.

    In tipc_sk_proto_rcv(), we never check for message errors while
    processing the connection acknowledgment or probe messages. Thus
    this message performs the usual flow control accounting and leaves
    the session hanging.

    In this commit, we terminate the connection when we receive such
    error messages.

    Signed-off-by: Parthasarathy Bhuvaragan
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • Until now, the checks for sockets in CONNECTING state was based on
    the assumption that the incoming message was always from the
    peer's accepted data socket.

    However an application using a non-blocking socket sends an implicit
    connect, this socket which is in CONNECTING state can receive error
    messages from the peer's listening socket. As we discard these
    messages, the application socket hangs as there due to inactivity.
    In addition to this, there are other places where we process errors
    but do not notify the user.

    In this commit, we process such incoming error messages and notify
    our users about them using sk_state_change().

    Signed-off-by: Parthasarathy Bhuvaragan
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • In filter_connect, we use waitqueue_active() to check for any
    connections to wakeup. But waitqueue_active() is missing memory
    barriers while accessing the critical sections, leading to
    inconsistent results.

    In this commit, we replace this with an SMP safe wq_has_sleeper()
    using the generic socket callback sk_data_ready().

    Signed-off-by: Parthasarathy Bhuvaragan
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

27 Apr, 2017

1 commit


25 Apr, 2017

3 commits

  • Until now in tipc_recv_stream(), we update the received
    unacknowledged bytes based on a stack variable and not based on the
    actual message size.
    If the user buffer passed at tipc_recv_stream() is smaller than the
    received skb, the size variable in stack differs from the actual
    message size in the skb. This leads to a flow control accounting
    error causing permanent congestion.

    In this commit, we fix this accounting error by always using the
    size of the incoming message.

    Fixes: 10724cc7bb78 ("tipc: redesign connection-level flow control")
    Signed-off-by: Parthasarathy Bhuvaragan
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • Until now in tipc_send_stream(), we return -1 when the socket
    encounters link congestion even if the socket had successfully
    sent partial data. This is incorrect as the application resends
    the same the partial data leading to data corruption at
    receiver's end.

    In this commit, we return the partially sent bytes as the return
    value at link congestion.

    Fixes: 10724cc7bb78 ("tipc: redesign connection-level flow control")
    Signed-off-by: Parthasarathy Bhuvaragan
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • Function nlmsg_new() will return a NULL pointer if there is no enough
    memory, and its return value should be checked before it is used.
    However, in function tipc_nl_node_get_monitor(), the validation of the
    return value of function nlmsg_new() is missed. This patch fixes the
    bug.

    Signed-off-by: Pan Bian
    Signed-off-by: David S. Miller

    Pan Bian
     

14 Apr, 2017

2 commits

  • This is an add-on to the previous patch that passes the extended ACK
    structure where it's already available by existing genl_info or extack
    function arguments.

    This was done with this spatch (with some manual adjustment of
    indentation):

    @@
    expression A, B, C, D, E;
    identifier fn, info;
    @@
    fn(..., struct genl_info *info, ...) {
    ...
    -nlmsg_parse(A, B, C, D, E, NULL)
    +nlmsg_parse(A, B, C, D, E, info->extack)
    ...
    }

    @@
    expression A, B, C, D, E;
    identifier fn, info;
    @@
    fn(..., struct genl_info *info, ...) {
    extack)
    ...>
    }

    @@
    expression A, B, C, D, E;
    identifier fn, extack;
    @@
    fn(..., struct netlink_ext_ack *extack, ...) {

    }

    @@
    expression A, B, C, D, E;
    identifier fn, extack;
    @@
    fn(..., struct netlink_ext_ack *extack, ...) {

    }

    @@
    expression A, B, C, D, E;
    identifier fn, extack;
    @@
    fn(..., struct netlink_ext_ack *extack, ...) {
    ...
    -nlmsg_parse(A, B, C, D, E, NULL)
    +nlmsg_parse(A, B, C, D, E, extack)
    ...
    }

    @@
    expression A, B, C, D;
    identifier fn, extack;
    @@
    fn(..., struct netlink_ext_ack *extack, ...) {

    }

    @@
    expression A, B, C, D;
    identifier fn, extack;
    @@
    fn(..., struct netlink_ext_ack *extack, ...) {

    }

    @@
    expression A, B, C, D;
    identifier fn, extack;
    @@
    fn(..., struct netlink_ext_ack *extack, ...) {

    }

    @@
    expression A, B, C;
    identifier fn, extack;
    @@
    fn(..., struct netlink_ext_ack *extack, ...) {

    }

    Signed-off-by: Johannes Berg
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Pass the new extended ACK reporting struct to all of the generic
    netlink parsing functions. For now, pass NULL in almost all callers
    (except for some in the core.)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

30 Mar, 2017

2 commits


29 Mar, 2017

2 commits

  • When a new subscription object is inserted into name_seq->subscriptions
    list, it's under name_seq->lock protection; when a subscription is
    deleted from the list, it's also under the same lock protection;
    similarly, when accessing a subscription by going through subscriptions
    list, the entire process is also protected by the name_seq->lock.

    Therefore, if subscription refcount is increased before it's inserted
    into subscriptions list, and its refcount is decreased after it's
    deleted from the list, it will be unnecessary to hold refcount at all
    before accessing subscription object which is obtained by going through
    subscriptions list under name_seq->lock protection.

    Signed-off-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • After a subscription object is created, it's inserted into its
    subscriber subscrp_list list under subscriber lock protection,
    similarly, before it's destroyed, it should be first removed from
    its subscriber->subscrp_list. Since the subscription list is
    accessed with subscriber lock, all the subscriptions are valid
    during the lock duration. Hence in tipc_subscrb_subscrp_delete(), we
    remove subscription get/put and the extra subscriber unlock/lock.

    After this change, the subscriptions refcount cleanup is very simple
    and does not access any lock.

    Acked-by: Jon Maloy
    Signed-off-by: Ying Xue
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Ying Xue
     

23 Mar, 2017

1 commit

  • Until now, tipc_nametbl_unsubscribe() is called at subscriptions
    reference count cleanup. Usually the subscriptions cleanup is
    called at subscription timeout or at subscription cancel or at
    subscriber delete.

    We have ignored the possibility of this being called from other
    locations, which causes deadlock as we try to grab the
    tn->nametbl_lock while holding it already.

    CPU1: CPU2:
    ---------- ----------------
    tipc_nametbl_publish
    spin_lock_bh(&tn->nametbl_lock)
    tipc_nametbl_insert_publ
    tipc_nameseq_insert_publ
    tipc_subscrp_report_overlap
    tipc_subscrp_get
    tipc_subscrp_send_event
    tipc_close_conn
    tipc_subscrb_release_cb
    tipc_subscrb_delete
    tipc_subscrp_put
    tipc_subscrp_put
    tipc_subscrp_kref_release
    tipc_nametbl_unsubscribe
    spin_lock_bh(&tn->nametbl_lock)
    <>

    CPU1: CPU2:
    ---------- ----------------
    tipc_nametbl_stop
    spin_lock_bh(&tn->nametbl_lock)
    tipc_purge_publications
    tipc_nameseq_remove_publ
    tipc_subscrp_report_overlap
    tipc_subscrp_get
    tipc_subscrp_send_event
    tipc_close_conn
    tipc_subscrb_release_cb
    tipc_subscrb_delete
    tipc_subscrp_put
    tipc_subscrp_put
    tipc_subscrp_kref_release
    tipc_nametbl_unsubscribe
    spin_lock_bh(&tn->nametbl_lock)
    <>

    In this commit, we advance the calling of tipc_nametbl_unsubscribe()
    from the refcount cleanup to the intended callers.

    Fixes: d094c4d5f5c7 ("tipc: add subscription refcount to avoid invalid delete")
    Reported-by: John Thompson
    Acked-by: Jon Maloy
    Signed-off-by: Ying Xue
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Ying Xue
     

10 Mar, 2017

1 commit

  • Lockdep issues a circular dependency warning when AFS issues an operation
    through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

    The theory lockdep comes up with is as follows:

    (1) If the pagefault handler decides it needs to read pages from AFS, it
    calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
    creating a call requires the socket lock:

    mmap_sem must be taken before sk_lock-AF_RXRPC

    (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
    binds the underlying UDP socket whilst holding its socket lock.
    inet_bind() takes its own socket lock:

    sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

    (3) Reading from a TCP socket into a userspace buffer might cause a fault
    and thus cause the kernel to take the mmap_sem, but the TCP socket is
    locked whilst doing this:

    sk_lock-AF_INET must be taken before mmap_sem

    However, lockdep's theory is wrong in this instance because it deals only
    with lock classes and not individual locks. The AF_INET lock in (2) isn't
    really equivalent to the AF_INET lock in (3) as the former deals with a
    socket entirely internal to the kernel that never sees userspace. This is
    a limitation in the design of lockdep.

    Fix the general case by:

    (1) Double up all the locking keys used in sockets so that one set are
    used if the socket is created by userspace and the other set is used
    if the socket is created by the kernel.

    (2) Store the kern parameter passed to sk_alloc() in a variable in the
    sock struct (sk_kern_sock). This informs sock_lock_init(),
    sock_init_data() and sk_clone_lock() as to the lock keys to be used.

    Note that the child created by sk_clone_lock() inherits the parent's
    kern setting.

    (3) Add a 'kern' parameter to ->accept() that is analogous to the one
    passed in to ->create() that distinguishes whether kernel_accept() or
    sys_accept4() was the caller and can be passed to sk_alloc().

    Note that a lot of accept functions merely dequeue an already
    allocated socket. I haven't touched these as the new socket already
    exists before we get the parameter.

    Note also that there are a couple of places where I've made the accepted
    socket unconditionally kernel-based:

    irda_accept()
    rds_rcp_accept_one()
    tcp_accept_from_sock()

    because they follow a sock_create_kern() and accept off of that.

    Whilst creating this, I noticed that lustre and ocfs don't create sockets
    through sock_create_kern() and thus they aren't marked as for-kernel,
    though they appear to be internal. I wonder if these should do that so
    that they use the new set of lock keys.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

02 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • In the function tipc_rcv() we initialize a couple of stack variables
    from the message header before that same header has been validated.
    In rare cases when the arriving header is non-linar, the validation
    function itself may linearize the buffer by calling skb_may_pull(),
    while the wrongly initialized stack fields are not updated accordingly.

    We fix this in this commit.

    Reported-by: Matthew Wong
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

18 Feb, 2017

1 commit

  • There are two problems with the function tipc_sk_reinit. Firstly
    it's doing a manual walk over an rhashtable. This is broken as
    an rhashtable can be resized and if you manually walk over it
    during a resize then you may miss entries.

    Secondly it's missing memory barriers as previously the code used
    spinlocks which provide the barriers implicitly.

    This patch fixes both problems.

    Fixes: 07f6c4bc048a ("tipc: convert tipc reference table to...")
    Signed-off-by: Herbert Xu
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Herbert Xu
     

28 Jan, 2017

1 commit


26 Jan, 2017

1 commit


25 Jan, 2017

6 commits

  • In tipc_server_stop(), we iterate over the connections with limiting
    factor as server's idr_in_use. We ignore the fact that this variable
    is decremented in tipc_close_conn(), leading to premature exit.

    In this commit, we iterate until the we have no connections left.

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Tested-by: John Thompson
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • In tipc_conn_sendmsg(), we first queue the request to the outqueue
    followed by the connection state check. If the connection is not
    connected, we should not queue this message.

    In this commit, we reject the messages if the connection state is
    not CF_CONNECTED.

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Tested-by: John Thompson
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • Commit 333f796235a527 ("tipc: fix a race condition leading to
    subscriber refcnt bug") reveals a soft lockup while acquiring
    nametbl_lock.

    Before commit 333f796235a527, we call tipc_conn_shutdown() from
    tipc_close_conn() in the context of tipc_topsrv_stop(). In that
    context, we are allowed to grab the nametbl_lock.

    Commit 333f796235a527, moved tipc_conn_release (renamed from
    tipc_conn_shutdown) to the connection refcount cleanup. This allows
    either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup.

    Since tipc_exit_net() first calls tipc_topsrv_stop() and then
    tipc_nametble_withdraw() increases the chances for the later to
    perform the connection cleanup.

    The soft lockup occurs in the call chain of tipc_nametbl_withdraw(),
    when it performs the tipc_conn_kref_release() as it tries to grab
    nametbl_lock again while holding it already.
    tipc_nametbl_withdraw() grabs nametbl_lock
    tipc_nametbl_remove_publ()
    tipc_subscrp_report_overlap()
    tipc_subscrp_send_event()
    tipc_conn_sendmsg()
    << if (con->flags != CF_CONNECTED) we do conn_put(),
    triggering the cleanup as refcount=0. >>
    tipc_conn_kref_release
    tipc_sock_release
    tipc_conn_release
    tipc_subscrb_delete
    tipc_subscrp_delete
    tipc_nametbl_unsubscribe << Soft Lockup >>

    The previous changes in this series fixes the race conditions fixed
    by commit 333f796235a527. Hence we can now revert the commit.

    Fixes: 333f796235a52727 ("tipc: fix a race condition leading to subscriber refcnt bug")
    Reported-and-Tested-by: John Thompson
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • Until now, the generic server framework maintains the connection
    id's per subscriber in server's conn_idr. At tipc_close_conn, we
    remove the connection id from the server list, but the connection is
    valid until we call the refcount cleanup. Hence we have a window
    where the server allocates the same connection to an new subscriber
    leading to inconsistent reference count. We have another refcount
    warning we grab the refcount in tipc_conn_lookup() for connections
    with flag with CF_CONNECTED not set. This usually occurs at shutdown
    when the we stop the topology server and withdraw TIPC_CFG_SRV
    publication thereby triggering a withdraw message to subscribers.

    In this commit, we:
    1. remove the connection from the server list at recount cleanup.
    2. grab the refcount for a connection only if CF_CONNECTED is set.

    Tested-by: John Thompson
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • Until now, the subscribers keep track of the subscriptions using
    reference count at subscriber level. At subscription cancel or
    subscriber delete, we delete the subscription only if the timer
    was pending for the subscription. This approach is incorrect as:
    1. del_timer() is not SMP safe, if on CPU0 the check for pending
    timer returns true but CPU1 might schedule the timer callback
    thereby deleting the subscription. Thus when CPU0 is scheduled,
    it deletes an invalid subscription.
    2. We export tipc_subscrp_report_overlap(), which accesses the
    subscription pointer multiple times. Meanwhile the subscription
    timer can expire thereby freeing the subscription and we might
    continue to access the subscription pointer leading to memory
    violations.

    In this commit, we introduce subscription refcount to avoid deleting
    an invalid subscription.

    Reported-and-Tested-by: John Thompson
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • We trigger a soft lockup as we grab nametbl_lock twice if the node
    has a pending node up/down or link up/down event while:
    - we process an incoming named message in tipc_named_rcv() and
    perform an tipc_update_nametbl().
    - we have pending backlog items in the name distributor queue
    during a nametable update using tipc_nametbl_publish() or
    tipc_nametbl_withdraw().

    The following are the call chain associated:
    tipc_named_rcv() Grabs nametbl_lock
    tipc_update_nametbl() (publish/withdraw)
    tipc_node_subscribe()/unsubscribe()
    tipc_node_write_unlock()
    << lockup occurs if an outstanding node/link event
    exits, as we grabs nametbl_lock again >>

    tipc_nametbl_withdraw() Grab nametbl_lock
    tipc_named_process_backlog()
    tipc_update_nametbl()
    << rest as above >>

    The function tipc_node_write_unlock(), in addition to releasing the
    lock processes the outstanding node/link up/down events. To do this,
    we need to grab the nametbl_lock again leading to the lockup.

    In this commit we fix the soft lockup by introducing a fast variant of
    node_unlock(), where we just release the lock. We adapt the
    node_subscribe()/node_unsubscribe() to use the fast variants.

    Reported-and-Tested-by: John Thompson
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

21 Jan, 2017

4 commits

  • If the bearer carrying multicast messages supports broadcast, those
    messages will be sent to all cluster nodes, irrespective of whether
    these nodes host any actual destinations socket or not. This is clearly
    wasteful if the cluster is large and there are only a few real
    destinations for the message being sent.

    In this commit we extend the eligibility of the newly introduced
    "replicast" transmit option. We now make it possible for a user to
    select which method he wants to be used, either as a mandatory setting
    via setsockopt(), or as a relative setting where we let the broadcast
    layer decide which method to use based on the ratio between cluster
    size and the message's actual number of destination nodes.

    In the latter case, a sending socket must stick to a previously
    selected method until it enters an idle period of at least 5 seconds.
    This eliminates the risk of message reordering caused by method change,
    i.e., when changes to cluster size or number of destinations would
    otherwise mandate a new method to be used.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • TIPC multicast messages are currently carried over a reliable
    'broadcast link', making use of the underlying media's ability to
    transport packets as L2 broadcast or IP multicast to all nodes in
    the cluster.

    When the used bearer is lacking that ability, we can instead emulate
    the broadcast service by replicating and sending the packets over as
    many unicast links as needed to reach all identified destinations.
    We now introduce a new TIPC link-level 'replicast' service that does
    this.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As a further preparation for the upcoming 'replicast' functionality,
    we add some necessary structs and functions for looking up and returning
    a list of all nodes that host destinations for a given multicast message.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • As a preparation for the 'replicast' functionality we are going to
    introduce in the next commits, we need the broadcast base structure to
    store whether bearer broadcast is available at all from the currently
    used bearer or bearers.

    We do this by adding a new function tipc_bearer_bcast_support() to
    the bearer layer, and letting the bearer selection function in
    bcast.c use this to give a new boolean field, 'bcast_support' the
    appropriate value.

    Reviewed-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

18 Jan, 2017

1 commit


17 Jan, 2017

1 commit

  • Until now, we allocate memory always with GFP_ATOMIC flag.
    When the system is under memory pressure and a user tries to send,
    the send fails due to low memory. However, the user application
    can wait for free memory if we allocate it using GFP_KERNEL flag.

    In this commit, we use allocate memory with GFP_KERNEL for all user
    allocation.

    Reported-by: Rune Torgersen
    Acked-by: Jon Maloy
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

04 Jan, 2017

3 commits

  • The socket code currently handles link congestion by either blocking
    and trying to send again when the congestion has abated, or just
    returning to the user with -EAGAIN and let him re-try later.

    This mechanism is prone to starvation, because the wakeup algorithm is
    non-atomic. During the time the link issues a wakeup signal, until the
    socket wakes up and re-attempts sending, other senders may have come
    in between and occupied the free buffer space in the link. This in turn
    may lead to a socket having to make many send attempts before it is
    successful. In extremely loaded systems we have observed latency times
    of several seconds before a low-priority socket is able to send out a
    message.

    In this commit, we simplify this mechanism and reduce the risk of the
    described scenario happening. When a message is attempted sent via a
    congested link, we now let it be added to the link's backlog queue
    anyway, thus permitting an oversubscription of one message per source
    socket. We still create a wakeup item and return an error code, hence
    instructing the sender to block or stop sending. Only when enough space
    has been freed up in the link's backlog queue do we issue a wakeup event
    that allows the sender to continue with the next message, if any.

    The fact that a socket now can consider a message sent even when the
    link returns a congestion code means that the sending socket code can
    be simplified. Also, since this is a good opportunity to get rid of the
    obsolete 'mtu change' condition in the three socket send functions, we
    now choose to refactor those functions completely.

    Signed-off-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • During multicast reception we currently use a simple linked list with
    push/pop semantics to store port numbers.

    We now see a need for a more generic list for storing values of type
    u32. We therefore make some modifications to this list, while replacing
    the prefix 'tipc_plist_' with 'u32_'. We also add a couple of new
    functions which will come to use in the next commits.

    Acked-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The functions tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() are very
    similar. The latter function is also called from two locations, and
    there will be more in the coming commits, which will all need to test on
    different conditions.

    Instead of making yet another duplicates of the function, we now
    introduce a new macro tipc_wait_for_cond() where the wakeup condition
    can be stated as an argument to the call. This macro replaces all
    current and future uses of the two functions, which can now be
    eliminated.

    Acked-by: Parthasarathy Bhuvaragan
    Acked-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

24 Dec, 2016

1 commit

  • In commit 6f00089c7372 ("tipc: remove SS_DISCONNECTING state") the
    check for socket type is in the wrong place, causing a closing socket
    to always send out a FIN message even when the socket was never
    connected. This is normally harmless, since the destination node for
    such messages most often is zero, and the message will be dropped, but
    it is still a wrong and confusing behavior.

    We fix this in this commit.

    Reviewed-by: Parthasarathy Bhuvaragan
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

17 Dec, 2016

1 commit

  • Pull vfs updates from Al Viro:

    - more ->d_init() stuff (work.dcache)

    - pathname resolution cleanups (work.namei)

    - a few missing iov_iter primitives - copy_from_iter_full() and
    friends. Either copy the full requested amount, advance the iterator
    and return true, or fail, return false and do _not_ advance the
    iterator. Quite a few open-coded callers converted (and became more
    readable and harder to fuck up that way) (work.iov_iter)

    - several assorted patches, the big one being logfs removal

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    logfs: remove from tree
    vfs: fix put_compat_statfs64() does not handle errors
    namei: fold should_follow_link() with the step into not-followed link
    namei: pass both WALK_GET and WALK_MORE to should_follow_link()
    namei: invert WALK_PUT logics
    namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
    namei: saner calling conventions for mountpoint_last()
    namei.c: get rid of user_path_parent()
    switch getfrag callbacks to ..._full() primitives
    make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
    [iov_iter] new primitives - copy_from_iter_full() and friends
    don't open-code file_inode()
    ceph: switch to use of ->d_init()
    ceph: unify dentry_operations instances
    lustre: switch to use of ->d_init()

    Linus Torvalds