04 Jan, 2006

2 commits


30 Nov, 2005

2 commits

  • the patch below marks various variables const in net/; the goal is to
    move them to the .rodata section so that they can't false-share
    cachelines with things that get written to, as well as potentially
    helping gcc a bit with optimisations. (these were found using a gcc
    patch to warn about such variables)

    Signed-off-by: Arjan van de Ven
    Signed-off-by: David S. Miller

    Arjan van de Ven
     
  • The tcp_ehash hash table gets too big on systems with really big memory.
    It is worse on systems with pages larger than 4KB. It wastes memory that
    could be better used. It also makes the netstat command slow because reading
    /proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table.

    The default value should not be larger for larger page sizes. It seems
    that the effect of page size is an unintended error dating back a long
    time. I also wonder if the default value really should be a larger
    fraction of memory for systems with more memory. While systems with
    really big ram can afford more space for hash tables, it is not clear to
    me that they benefit from increasing the allocation ratio for this table.

    The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and
    mm/page_alloc.c:alloc_large_system_hash.

    tcp_init calls alloc_large_system_hash passing parameters-
    bucketsize=sizeof(struct tcp_ehash_bucket)
    numentries=thash_entries
    scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT)
    limit=0

    On i386, PAGE_SHIFT is 12 for a page size of 4K
    On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K

    The num_physpages test above makes the allocation take a larger fraction
    of the total memory on systems with larger memory. The threshold size
    for a i386 system is 512MB. For an ia64 system with 16KB pages the
    threshold is 2GB.

    For smaller memory systems-
    On i386, scale = (27 - 12) = 15
    On ia64, scale = (27 - 14) = 13
    For larger memory systems-
    On i386, scale = (25 - 12) = 13
    On ia64, scale = (25 - 14) = 11

    For the rest of this discussion, I'll just track the larger memory case.

    The default behavior has numentries=thash_entries=0, so the allocated
    size is determined by either scale or by the default limit of 1/16 of
    total memory.

    In alloc_large_system_hash-
    | numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
    | numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
    | numentries >>= 20 - PAGE_SHIFT;
    | numentries < PAGE_SHIFT)
    | numentries >>= (scale - PAGE_SHIFT);
    | else
    | numentries <>= (13 - 12), so numentries is 1/8196 of
    bytes of total memory.
    On ia64, numentries <<< log2qty;

    bucketsize is 16, so size is 16 times numentries, rounded
    down to a power of two.

    On i386, size is 1/512 of bytes of total memory.
    On ia64, size is 1/128 of bytes of total memory.

    For smaller systems the results are
    On i386, size is 1/2048 of bytes of total memory.
    On ia64, size is 1/512 of bytes of total memory.

    The large page effect can be removed by just replacing
    the use of PAGE_SHIFT with a constant of 12 in the calls to
    alloc_large_system_hash. That makes them more like the other uses of
    that function from fs/inode.c and fs/dcache.c

    Signed-off-by: David S. Miller

    Mike Stroyan
     

11 Nov, 2005

2 commits

  • Minor spelling fixes for TCP code.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • This is an updated version of the RFC3465 ABC patch originally
    for Linux 2.6.11-rc4 by Yee-Ting Li. ABC is a way of counting
    bytes ack'd rather than packets when updating congestion control.

    The orignal ABC described in the RFC applied to a Reno style
    algorithm. For advanced congestion control there is little
    change after leaving slow start.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

06 Nov, 2005

1 commit

  • This patch randomizes the port selected on bind() for connections
    to help with possible security attacks. It should also be faster
    in most cases because there is no need for a global lock.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: Arnaldo Carvalho de Melo

    Stephen Hemminger
     

06 Sep, 2005

1 commit


02 Sep, 2005

2 commits

  • I've finally found a potential cause of the sk_forward_alloc underflows
    that people have been reporting sporadically.

    When tcp_sendmsg tacks on extra bits to an existing TCP_PAGE we don't
    check sk_forward_alloc even though a large amount of time may have
    elapsed since we allocated the page. In the mean time someone could've
    come along and liberated packets and reclaimed sk_forward_alloc memory.

    This patch makes tcp_sendmsg check sk_forward_alloc every time as we
    do in do_tcp_sendpages.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch introduces sk_stream_wmem_schedule as a short-hand for
    the sk_forward_alloc checking on egress.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

30 Aug, 2005

17 commits

  • This patch puts mostly read only data in the right section
    (read_mostly), to help sharing of these data between CPUS without
    memory ping pongs.

    On one of my production machine, tcp_statistics was sitting in a
    heavily modified cache line, so *every* SNMP update had to force a
    reload.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This changeset basically moves tcp_sk()->{ca_ops,ca_state,etc} to inet_csk(),
    minimal renaming/moving done in this changeset to ease review.

    Most of it is just changes of struct tcp_sock * to struct sock * parameters.

    With this we move to a state closer to two interesting goals:

    1. Generalisation of net/ipv4/tcp_diag.c, becoming inet_diag.c, being used
    for any INET transport protocol that has struct inet_hashinfo and are
    derived from struct inet_connection_sock. Keeps the userspace API, that will
    just not display DCCP sockets, while newer versions of tools can support
    DCCP.

    2. INET generic transport pluggable Congestion Avoidance infrastructure, using
    the current TCP CA infrastructure with DCCP.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • That groups all of the tables and variables associated to the TCP timewait
    schedulling/recycling/killing code, that now can be isolated from the TCP
    specific code and used by other transport protocols, such as DCCP.

    Next changeset will move this code to net/ipv4/inet_timewait_sock.c

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This also improves reqsk_queue_prune and renames it to
    inet_csk_reqsk_queue_prune, as it deals with both inet_connection_sock
    and inet_request_sock objects, not just with request_sock ones thus
    belonging to inet_request_sock.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • With this we're very close to getting all of the current TCP
    refactorings in my dccp-2.6 tree merged, next changeset will export
    some functions needed by the current DCCP code and then dccp-2.6.git
    will be born!

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This also moved inet_iif from tcp to inet_hashtables.h, as it is
    needed by the inet_lookup callers, perhaps this needs a bit of
    polishing, but for now seems fine.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Completing the previous changeset, this also generalises tcp_v4_synq_add,
    renaming it to inet_csk_reqsk_queue_hash_add, already geing used in the
    DCCP tree, which I plan to merge RSN.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This creates struct inet_connection_sock, moving members out of struct
    tcp_sock that are shareable with other INET connection oriented
    protocols, such as DCCP, that in my private tree already uses most of
    these members.

    The functions that operate on these members were renamed, using a
    inet_csk_ prefix while not being moved yet to a new file, so as to
    ease the review of these changes.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This paves the way to generalise the rest of the sock ID lookup
    routines and saves some bytes in TCPv4 TIME_WAIT sockets on distro
    kernels (where IPv6 is always built as a module):

    [root@qemu ~]# grep tw_sock /proc/slabinfo
    tw_sock_TCPv6 0 0 128 31 1
    tw_sock_TCP 0 0 96 41 1
    [root@qemu ~]#

    Now if a protocol wants to use the TIME_WAIT generic infrastructure it
    only has to set the sk_prot->twsk_obj_size field with the size of its
    inet_timewait_sock derived sock and proto_register will create
    sk_prot->twsk_slab, for now its only for INET sockets, but we can
    introduce timewait_sock later if some non INET transport protocolo
    wants to use this stuff.

    Next changesets will take advantage of this new infrastructure to
    generalise even more TCP code.

    [acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
    /tmp/before.size: 188646 11764 5068 205478 322a6 net/ipv4/built-in.o
    /tmp/after.size: 188144 11764 5068 204976 320b0 net/ipv4/built-in.o
    [acme@toy net-2.6.14]$

    Tested with both IPv4 & IPv6 (::1 (localhost) & ::ffff:172.20.0.1
    (qemu host)).

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Also expose all of the tcp_hashinfo members, i.e. killing those
    tcp_ehash, etc macros, this will more clearly expose already generic
    functions and some that need just a bit of work to become generic, as
    we'll see in the upcoming changesets.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This required moving tcp_bucket_cachep to inet_hashinfo.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This should really be in a inet_connection_sock, but I'm leaving it
    for a later optimization, when some more fields common to INET
    transport protocols now in tcp_sk or inet_sk will be chunked out into
    inet_connection_sock, for now its better to concentrate on getting the
    changes in the core merged to leave the DCCP tree with only DCCP
    specific code.

    Next changesets will take advantage of this move to generalise things
    like tcp_bind_hash, tcp_put_port, tcp_inherit_port, making the later
    receive a inet_hashinfo parameter, and even __tcp_tw_hashdance, etc in
    the future, when tcp_tw_bucket gets transformed into the struct
    timewait_sock hierarchy.

    tcp_destroy_sock also is eligible as soon as tcp_orphan_count gets
    moved to sk_prot.

    A cascade of incremental changes will ultimately make the tcp_lookup
    functions be fully generic.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This is to break down the complexity of the series of patches,
    making it very clear that this one just does:

    1. renames tcp_ prefixed hashtable functions and data structures that
    were already mostly generic to inet_ to share it with DCCP and
    other INET transport protocols.

    2. Removes not used functions (__tb_head & tb_head)

    3. Removes some leftover prototypes in the headers (tcp_bucket_unlock &
    tcp_v4_build_header)

    Next changesets will move tcp_sk(sk)->bind_hash to inet_sock so that we can
    make functions such as tcp_inherit_port, __tcp_inherit_port, tcp_v4_get_port,
    __tcp_put_port, generic and get others like tcp_destroy_sock closer to generic
    (tcp_orphan_count will go to sk->sk_prot to allow this).

    Eventually most of these functions will be used passing the transport protocol
    inet_hashinfo structure.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Remove the "list" member of struct sk_buff, as it is entirely
    redundant. All SKB list removal callers know which list the
    SKB is on, so storing this in sk_buff does nothing other than
    taking up some space.

    Two tricky bits were SCTP, which I took care of, and two ATM
    drivers which Francois Romieu fixed
    up.

    Signed-off-by: David S. Miller
    Signed-off-by: Francois Romieu

    David S. Miller
     

24 Aug, 2005

1 commit

  • Intention of this bit is to force pushing of the existing
    send queue when TCP_CORK or TCP_NODELAY state changes via
    setsockopt().

    But it's easy to create a situation where the bit never
    clears. For example, if the send queue starts empty:

    1) set TCP_NODELAY
    2) clear TCP_NODELAY
    3) set TCP_CORK
    4) do small write()

    The current code will leave TCP_NAGLE_PUSH set after that
    sequence. Unconditionally clearing the bit when new data
    is added via skb_entail() solves the problem.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Jul, 2005

1 commit

  • This is part of the grand scheme to eliminate the qlen
    member of skb_queue_head, and subsequently remove the
    'list' member of sk_buff.

    Most users of skb_queue_len() want to know if the queue is
    empty or not, and that's trivially done with skb_queue_empty()
    which doesn't use the skb_queue_head->qlen member and instead
    uses the queue list emptyness as the test.

    Signed-off-by: David S. Miller

    David S. Miller
     

06 Jul, 2005

3 commits

  • Make TSO segment transmit size decisions at send time not earlier.

    The basic scheme is that we try to build as large a TSO frame as
    possible when pulling in the user data, but the size of the TSO frame
    output to the card is determined at transmit time.

    This is guided by tp->xmit_size_goal. It is always set to a multiple
    of MSS and tells sendmsg/sendpage how large an SKB to try and build.

    Later, tcp_write_xmit() and tcp_push_one() chop up the packet if
    necessary and conditions warrant. These routines can also decide to
    "defer" in order to wait for more ACKs to arrive and thus allow larger
    TSO frames to be emitted.

    A general observation is that TSO elongates the pipe, thus requiring a
    larger congestion window and larger buffering especially at the sender
    side. Therefore, it is important that applications 1) get a large
    enough socket send buffer (this is accomplished by our dynamic send
    buffer expansion code) 2) do large enough writes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Only put user data purely to pages when doing TSO.

    The extra page allocations cause two problems:

    1) Add the overhead of the page allocations themselves.
    2) Make us do small user copies when we get to the end
    of the TCP socket cache page.

    It is still beneficial to purely use pages for TSO,
    so we will do it for that case.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The ideal and most optimal layout for an SKB when doing
    scatter-gather is to put all the headers at skb->data, and
    all the user data in the page array.

    This makes SKB splitting and combining extremely simple,
    especially before a packet goes onto the wire the first
    time.

    So, when sk_stream_alloc_pskb() is given a zero size, make
    sure there is no skb_tailroom(). This is achieved by applying
    SKB_DATA_ALIGN() to the header length used here.

    Next, make select_size() in TCP output segmentation use a
    length of zero when NETIF_F_SG is true on the outgoing
    interface.

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Jun, 2005

2 commits


19 Jun, 2005

5 commits

  • When enabled, this should disable UCOPY prequeue'ing altogether,
    but it does not due to a missing test.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This chunks out the accept_queue and tcp_listen_opt code and moves
    them to net/core/request_sock.c and include/net/request_sock.h, to
    make it useful for other transport protocols, DCCP being the first one
    to use it.

    Next patches will rename tcp_listen_opt to accept_sock and remove the
    inline tcp functions that just call a reqsk_queue_ function.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Ok, this one just renames some stuff to have a better namespace and to
    dissassociate it from TCP:

    struct open_request -> struct request_sock
    tcp_openreq_alloc -> reqsk_alloc
    tcp_openreq_free -> reqsk_free
    tcp_openreq_fastfree -> __reqsk_free

    With this most of the infrastructure closely resembles a struct
    sock methods subset.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Kept this first changeset minimal, without changing existing names to
    ease peer review.

    Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
    has two new members:

    ->slab, that replaces tcp_openreq_cachep
    ->obj_size, to inform the size of the openreq descendant for
    a specific protocol

    The protocol specific fields in struct open_request were moved to a
    class hierarchy, with the things that are common to all connection
    oriented PF_INET protocols in struct inet_request_sock, the TCP ones
    in tcp_request_sock, that is an inet_request_sock, that is an
    open_request.

    I.e. this uses the same approach used for the struct sock class
    hierarchy, with sk_prot indicating if the protocol wants to use the
    open_request infrastructure by filling in sk_prot->rsk_prot with an
    or_calltable.

    Results? Performance is improved and TCP v4 now uses only 64 bytes per
    open request minisock, down from 96 without this patch :-)

    Next changeset will rename some of the structs, fields and functions
    mentioned above, struct or_calltable is way unclear, better name it
    struct request_sock_ops, s/struct open_request/struct request_sock/g,
    etc.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

14 Jun, 2005

1 commit