12 Jun, 2014

1 commit

  • The connection struct with nodeid 0 is the listening socket,
    not a connection to another node. The sctp resend function
    was not checking that the nodeid was valid (non-zero), so it
    would mistakenly get and resend on the listening connection
    when nodeid was zero.

    Signed-off-by: Lidong Zhong
    Signed-off-by: David Teigland

    Lidong Zhong
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

26 Jan, 2014

1 commit

  • Pull networking updates from David Miller:

    1) BPF debugger and asm tool by Daniel Borkmann.

    2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.

    3) Correct reciprocal_divide and update users, from Hannes Frederic
    Sowa and Daniel Borkmann.

    4) Currently we only have a "set" operation for the hw timestamp socket
    ioctl, add a "get" operation to match. From Ben Hutchings.

    5) Add better trace events for debugging driver datapath problems, also
    from Ben Hutchings.

    6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
    have a small send and a previous packet is already in the qdisc or
    device queue, defer until TX completion or we get more data.

    7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.

    8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
    Borkmann.

    9) Share IP header compression code between Bluetooth and IEEE802154
    layers, from Jukka Rissanen.

    10) Fix ipv6 router reachability probing, from Jiri Benc.

    11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.

    12) Support tunneling in GRO layer, from Jerry Chu.

    13) Allow bonding to be configured fully using netlink, from Scott
    Feldman.

    14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
    already get the TCI. From Atzm Watanabe.

    15) New "Heavy Hitter" qdisc, from Terry Lam.

    16) Significantly improve the IPSEC support in pktgen, from Fan Du.

    17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
    Herbert.

    18) Add Proportional Integral Enhanced packet scheduler, from Vijay
    Subramanian.

    19) Allow openvswitch to mmap'd netlink, from Thomas Graf.

    20) Key TCP metrics blobs also by source address, not just destination
    address. From Christoph Paasch.

    21) Support 10G in generic phylib. From Andy Fleming.

    22) Try to short-circuit GRO flow compares using device provided RX
    hash, if provided. From Tom Herbert.

    The wireless and netfilter folks have been busy little bees too.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
    net/cxgb4: Fix referencing freed adapter
    ipv6: reallocate addrconf router for ipv6 address when lo device up
    fib_frontend: fix possible NULL pointer dereference
    rtnetlink: remove IFLA_BOND_SLAVE definition
    rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
    qlcnic: update version to 5.3.55
    qlcnic: Enhance logic to calculate msix vectors.
    qlcnic: Refactor interrupt coalescing code for all adapters.
    qlcnic: Update poll controller code path
    qlcnic: Interrupt code cleanup
    qlcnic: Enhance Tx timeout debugging.
    qlcnic: Use bool for rx_mac_learn.
    bonding: fix u64 division
    rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
    sfc: Use the correct maximum TX DMA ring size for SFC9100
    Add Shradha Shah as the sfc driver maintainer.
    net/vxlan: Share RX skb de-marking and checksum checks with ovs
    tulip: cleanup by using ARRAY_SIZE()
    ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
    net/cxgb4: Don't retrieve stats during recovery
    ...

    Linus Torvalds
     

22 Jan, 2014

1 commit


16 Dec, 2013

1 commit

  • The recovery time for a failed node was taking a long
    time because the failed node could not perform the full
    shutdown process. Removing the linger time speeds this
    up. The dlm does not care what happens to messages to
    or from the failed node.

    Signed-off-by: Dongmao Zhang
    Signed-off-by: David Teigland

    Dongmao Zhang
     

19 Jun, 2013

1 commit


15 Jun, 2013

6 commits

  • For TCP we disable Nagle and I cannot think of why it would be needed
    for SCTP. When disabled it seems to improve dlm_lock operations like it
    does for TCP.

    Signed-off-by: Mike Christie
    Signed-off-by: David Teigland

    Mike Christie
     
  • Currently if a SCTP send fails, we lose the data we were trying
    to send because the writequeue_entry is released when we do the send.
    When this happens other nodes will then hang waiting for a reply.

    This adds support for SCTP to retry the send operation.

    I also removed the retry limit for SCTP use, because we want
    to make sure we try every path during init time and for longer
    failures we want to continually retry in case paths come back up
    while trying other paths. We will do this until userspace tells us
    to stop.

    Signed-off-by: Mike Christie
    Signed-off-by: David Teigland

    Mike Christie
     
  • Currently, if we cannot create a association to the first IP addr
    that is added to DLM, the SCTP init assoc code will just retry
    the same IP. This patch adds a simple failover schemes where we
    will try one of the addresses that was passed into DLM.

    Signed-off-by: Mike Christie
    Signed-off-by: David Teigland

    Mike Christie
     
  • We should be testing and cleaing the init pending bit because later
    when sctp_init_assoc is recalled it will be checking that it is not set
    and set the bit.

    We do not want to touch CF_CONNECT_PENDING here because we will queue
    swork and process_send_sockets will then call the connect_action function.

    Signed-off-by: Mike Christie
    Signed-off-by: David Teigland

    Mike Christie
     
  • sctp_assoc was not getting set so later lookups failed.

    Signed-off-by: Mike Christie
    Signed-off-by: David Teigland

    Mike Christie
     
  • We were clearing the base con's init pending flags, but the
    con for the node was the one with the pending bit set.

    Signed-off-by: Mike Christie
    Signed-off-by: David Teigland

    Mike Christie
     

10 Apr, 2013

1 commit

  • This patch introduces an UAPI header for the SCTP protocol,
    so that we can facilitate the maintenance and development of
    user land applications or libraries, in particular in terms
    of header synchronization.

    To not break compatibility, some fragments from lksctp-tools'
    netinet/sctp.h have been carefully included, while taking care
    that neither kernel nor user land breaks, so both compile fine
    with this change (for lksctp-tools I tested with the old
    netinet/sctp.h header and with a newly adapted one that includes
    the uapi sctp header). lksctp-tools smoke test run through
    successfully as well in both cases.

    Suggested-by: Neil Horman
    Cc: Neil Horman
    Cc: Vlad Yasevich
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

02 Nov, 2012

1 commit


13 Aug, 2012

1 commit


10 Aug, 2012

2 commits


09 Aug, 2012

1 commit

  • A deadlock sometimes occurs between dlm_controld closing
    a lowcomms connection through configfs and dlm_send looking
    up the address for a new connection in configfs.

    dlm_controld does a configfs rmdir which calls
    dlm_lowcomms_close which waits for dlm_send to
    cancel work on the workqueues.

    The dlm_send workqueue thread has called
    tcp_connect_to_sock which calls dlm_nodeid_to_addr
    which does a configfs lookup and blocks on a lock
    held by dlm_controld in the rmdir path.

    The solution here is to save the node addresses within
    the lowcomms code so that the lowcomms workqueue does
    not need to step through configfs to get a node address.

    dlm_controld:
    wait_for_completion+0x1d/0x20
    __cancel_work_timer+0x1b3/0x1e0
    cancel_work_sync+0x10/0x20
    dlm_lowcomms_close+0x4c/0xb0 [dlm]
    drop_comm+0x22/0x60 [dlm]
    client_drop_item+0x26/0x50 [configfs]
    configfs_rmdir+0x180/0x230 [configfs]
    vfs_rmdir+0xbd/0xf0
    do_rmdir+0x103/0x120
    sys_rmdir+0x16/0x20

    dlm_send:
    mutex_lock+0x2b/0x50
    get_comm+0x34/0x140 [dlm]
    dlm_nodeid_to_addr+0x18/0xd0 [dlm]
    tcp_connect_to_sock+0xf4/0x2d0 [dlm]
    process_send_sockets+0x1d2/0x260 [dlm]
    worker_thread+0x170/0x2a0

    Signed-off-by: David Teigland

    David Teigland
     

27 Apr, 2012

1 commit

  • During lowcomms shutdown, a new connection could possibly
    be created, and attempt to use a workqueue that's been
    destroyed. Similarly, during startup, a new connection
    could attempt to use a workqueue that's not been set up
    yet. Add a global variable to indicate when new connections
    are allowed.

    Based on patch by: Christine Caulfield

    Reported-by: dann frazier
    Reviewed-by: dann frazier
    Signed-off-by: David Teigland

    David Teigland
     

22 Mar, 2012

1 commit


21 Mar, 2012

1 commit


09 Mar, 2012

1 commit


23 Nov, 2011

1 commit


07 Jul, 2011

1 commit


31 Mar, 2011

1 commit


11 Mar, 2011

1 commit


12 Feb, 2011

1 commit

  • The recent commit to use cmwq for send and recv threads
    dcce240ead802d42b1e45ad2fcb2ed4a399cb255 introduced problems,
    apparently due to multiple workqueue threads. Single threads
    make the problems go away, so return to that until we fully
    understand the concurrency issues with multiple threads.

    Signed-off-by: David Teigland

    David Teigland
     

14 Dec, 2010

1 commit


13 Nov, 2010

3 commits

  • Calling cond_resched() after every send can unnecessarily
    degrade performance. Go back to an old method of scheduling
    after 25 messages.

    Signed-off-by: Bob Peterson
    Signed-off-by: David Teigland

    Bob Peterson
     
  • Nagling doesn't help and can sometimes hurt dlm comms.

    Signed-off-by: David Teigland

    David Teigland
     
  • So far as I can tell, there is no reason to use a single-threaded
    send workqueue for dlm, since it may need to send to several sockets
    concurrently. Both workqueues are set to WQ_MEM_RECLAIM to avoid
    any possible deadlocks, WQ_HIGHPRI since locking traffic is highly
    latency sensitive (and to avoid a priority inversion wrt GFS2's
    glock_workqueue) and WQ_FREEZABLE just in case someone needs to do
    that (even though with current cluster infrastructure, it doesn't
    make sense as the node will most likely land up ejected from the
    cluster) in the future.

    Signed-off-by: Steven Whitehouse
    Cc: Tejun Heo
    Signed-off-by: David Teigland

    Steven Whitehouse
     

12 Nov, 2010

1 commit

  • In the normal regime where an application uses non-blocking I/O
    writes on a socket, they will handle -EAGAIN and use poll() to
    wait for send space.

    They don't actually sleep on the socket I/O write.

    But kernel level RPC layers that do socket I/O operations directly
    and key off of -EAGAIN on the write() to "try again later" don't
    use poll(), they instead have their own sleeping mechanism and
    rely upon ->sk_write_space() to trigger the wakeup.

    So they do effectively sleep on the write(), but this mechanism
    alone does not let the socket layers know what's going on.

    Therefore they must emulate what would have happened, otherwise
    TCP cannot possibly see that the connection is application window
    size limited.

    Handle this, therefore, like SUNRPC by setting SOCK_NOSPACE and
    bumping the ->sk_write_count as needed when we hit the send buffer
    limits.

    This should make TCP send buffer size auto-tuning and the
    ->sk_write_space() callback invocations actually happen.

    Signed-off-by: David S. Miller
    Signed-off-by: David Teigland

    David Miller
     

06 Aug, 2010

1 commit

  • hlist_for_each_entry binds its first argument to a non-null value, and thus
    any null test on the value of that argument is superfluous.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    iterator I;
    expression x,E,E1,E2;
    statement S,S1,S2;
    @@

    I(x,...) { }
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: David Teigland

    Julia Lawall
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

01 Dec, 2009

1 commit

  • Replace all GFP_KERNEL and ls_allocation with GFP_NOFS.
    ls_allocation would be GFP_KERNEL for userland lockspaces
    and GFP_NOFS for file system lockspaces.

    It was discovered that any lockspaces on the system can
    affect all others by triggering memory reclaim in the
    file system which could in turn call back into the dlm
    to acquire locks, deadlocking dlm threads that were
    shared by all lockspaces, like dlm_recv.

    Signed-off-by: David Teigland

    David Teigland
     

01 Oct, 2009

2 commits

  • The code to set up sctp sockets was not using the sockfd_lookup()
    and sockfd_put() routines to translate an fd to a socket. The
    direct fget and fput calls were resulting in error messages from
    alloc_fd().

    Also clean up two log messages and remove a third, related to
    setting up sctp associations.

    Signed-off-by: David Teigland

    David Teigland
     
  • The recently added dlm_lowcomms_connect_node() from
    391fbdc5d527149578490db2f1619951d91f3561 does not work
    when using SCTP instead of TCP. The sctp connection code
    has nothing to do without data to send. Check for no data
    in the sctp connection code and do nothing instead of
    triggering a BUG. Also have connect_node() do nothing
    when the protocol is sctp.

    Signed-off-by: David Teigland

    David Teigland
     

25 Aug, 2009

2 commits

  • Using kernel_sendpage() is cleaner and safer than following
    sock->ops ourselves.

    Signed-off-by: Paolo Bonzini
    Signed-off-by: David Teigland

    Paolo Bonzini
     
  • Closing a connection to a node can create problems if there are
    outstanding messages for that node. The problems include dlm_send
    spinning attempting to reconnect, or BUG from tcp_connect_to_sock()
    attempting to use a partially closed connection.

    To cleanly close a connection, we now first attempt to send any pending
    messages, cancel any remaining workqueue work, and flag the connection
    as closed to avoid reconnect attempts.

    Signed-off-by: Lars Marowsky-Bree
    Signed-off-by: Christine Caulfield
    Signed-off-by: David Teigland

    Lars Marowsky-Bree