13 Jul, 2005

1 commit

  • Revert the nf_reset change that caused so much trouble, drop conntrack
    references manually before packets are queued to packet sockets.

    Signed-off-by: Phil Oester
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Phil Oester
     

12 Jul, 2005

3 commits

  • Move the protocol specific config options out to the specific protocols.
    With this change net/Kconfig now starts to become readable and serve as a
    good basis for further re-structuring.

    The menu structure is left almost intact, except that indention is
    fixed in most cases. Most visible are the INET changes where several
    "depends on INET" are replaced with a single ifdef INET / endif pair.

    Several new files were created to accomplish this change - they are
    small but serve the purpose that config options are now distributed
    out where they belongs.

    Signed-off-by: Sam Ravnborg
    Signed-off-by: David S. Miller

    Sam Ravnborg
     
  • In some cases, we may be generating packets with a source address that
    qualifies as martian. This can happen when we're in the middle of setting
    up the network, and netfilter decides to reject a packet with an RST.
    The IPv4 routing code would try to print a warning and oops, because
    locally generated packets do not have a valid skb->mac.raw pointer
    at this point.

    Signed-off-by: David S. Miller

    Olaf Kirch
     
  • An addition to the last ipvs changes that move
    update_defense_level/si_meminfo to keventd:

    - ip_vs_random_dropentry now runs in process context and should use _bh
    locks to protect from softirqs

    - update_defense_level still needs _bh locks after si_meminfo is called,
    for the same purpose

    Signed-off-by: Julian Anastasov
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Julian Anastasov
     

09 Jul, 2005

8 commits

  • This patch fixes the multicast group matching for
    IP_DROP_MEMBERSHIP, similar to the IP_ADD_MEMBERSHIP fix in a prior
    patch. Groups are identifiedby and including
    the interface address in the match will fail if a leave-group is done
    by address when the join was done by index, or if different addresses
    on the same interface are used in the join and leave.

    Signed-off-by: David L Stevens
    Signed-off-by: David S. Miller

    David L Stevens
     
  • 1) Adds (INCLUDE, empty)/leave-group equivalence to the full-state
    multicast source filter APIs (IPv4 and IPv6)

    2) Fixes an incorrect errno in the IPv6 leave-group (ENOENT should be
    EADDRNOTAVAIL)

    Signed-off-by: David L Stevens
    Signed-off-by: David S. Miller

    David L Stevens
     
  • 1) In the full-state API when imsf_numsrc == 0
    errno should be "0", but returns EADDRNOTAVAIL

    2) An illegal filter mode change
    errno should be EINVAL, but returns EADDRNOTAVAIL

    3) Trying to do an any-source option without IP_ADD_MEMBERSHIP
    errno should be EINVAL, but returns EADDRNOTAVAIL

    4) Adds comments for the less obvious error return values

    Signed-off-by: David L Stevens
    Signed-off-by: David S. Miller

    David L Stevens
     
  • 1) Changes IP_ADD_SOURCE_MEMBERSHIP and MCAST_JOIN_SOURCE_GROUP to ignore
    EADDRINUSE errors on a "courtesy join" -- prior membership or not
    is ok for these.

    2) Adds "leave group" equivalence of (INCLUDE, empty) filters in the
    delta-based API. Without this, mixing delta-based API calls that
    end in an (INCLUDE, empty) filter would not allow a subsequent
    regular IP_ADD_MEMBERSHIP. It also frees socket buffer memory that
    isn't needed for both the multicast group record and source filter.

    Signed-off-by: David L Stevens
    Signed-off-by: David S. Miller

    David L Stevens
     
  • This patch corrects a few problems with the IP_ADD_MEMBERSHIP
    socket option:

    1) The existing code makes an attempt at reference counting joins when
    using the ip_mreqn/imr_ifindex interface. Joining the same group
    on the same socket is an error, whatever the API. This leads to
    unexpected results when mixing ip_mreqn by index with ip_mreqn by
    address, ip_mreq, or other API's. For example, ip_mreq followed by
    ip_mreqn of the same group will "work" while the same two reversed
    will not.
    Fixed to always return EADDRINUSE on a duplicate join and
    removed the (now unused) reference count in ip_mc_socklist.

    2) The group-search list in ip_mc_join_group() is comparing a full
    ip_mreqn structure and all of it must match for it to find the
    group. This doesn't correctly match a group that was joined with
    ip_mreq or ip_mreqn with an address (with or without an index). It
    also doesn't match groups that are joined by different addresses on
    the same interface. All of these are the same multicast group,
    which is identified by group address and interface index.
    Fixed the check to correctly match groups so we don't get
    duplicate group entries on the ip_mc_socklist.

    3) The old code allocates a multicast address before searching for
    duplicates requiring it to free in various error cases. This
    patch moves the allocate until after the search and
    igmp_max_memberships check, so never a need to allocate, then free
    an entry.

    Signed-off-by: David L Stevens
    Signed-off-by: David S. Miller

    David L Stevens
     
  • This was the full intention of the original code.

    Signed-off-by: David S. Miller

    Alexey Kuznetsov
     
  • From: Victor Fusco

    Fix the sparse warning "implicit cast to nocast type"

    Signed-off-by: Victor Fusco
    Signed-off-by: Domen Puncer
    Signed-off-by: David S. Miller

    Victor Fusco
     
  • This is part of the grand scheme to eliminate the qlen
    member of skb_queue_head, and subsequently remove the
    'list' member of sk_buff.

    Most users of skb_queue_len() want to know if the queue is
    empty or not, and that's trivially done with skb_queue_empty()
    which doesn't use the skb_queue_head->qlen member and instead
    uses the queue list emptyness as the test.

    Signed-off-by: David S. Miller

    David S. Miller
     

06 Jul, 2005

24 commits

  • Congestion window recover after loss depends upon the fact
    that if we have a full MSS sized frame at the head of the
    send queue, we will send it. TSO deferral can defeat the
    ACK clocking necessary to exit cleanly from recovery.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Make TSO segment transmit size decisions at send time not earlier.

    The basic scheme is that we try to build as large a TSO frame as
    possible when pulling in the user data, but the size of the TSO frame
    output to the card is determined at transmit time.

    This is guided by tp->xmit_size_goal. It is always set to a multiple
    of MSS and tells sendmsg/sendpage how large an SKB to try and build.

    Later, tcp_write_xmit() and tcp_push_one() chop up the packet if
    necessary and conditions warrant. These routines can also decide to
    "defer" in order to wait for more ACKs to arrive and thus allow larger
    TSO frames to be emitted.

    A general observation is that TSO elongates the pipe, thus requiring a
    larger congestion window and larger buffering especially at the sender
    side. Therefore, it is important that applications 1) get a large
    enough socket send buffer (this is accomplished by our dynamic send
    buffer expansion code) 2) do large enough writes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This makes it easier to understand, and allows easier
    tweaking of the heuristic later on.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In tcp_clean_rtx_queue(), if the TSO packet is not even partially
    acked, do not waste time calling tcp_tso_acked().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Everything stated there is out of data. tcp_trim_skb()
    does adjust the available socket send buffer space and
    skb->truesize now.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Only put user data purely to pages when doing TSO.

    The extra page allocations cause two problems:

    1) Add the overhead of the page allocations themselves.
    2) Make us do small user copies when we get to the end
    of the TCP socket cache page.

    It is still beneficial to purely use pages for TSO,
    so we will do it for that case.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • tcp_snd_test() is run for every packet output by a single
    call to tcp_write_xmit(), but this is not necessary.

    For one, the congestion window space needs to only be
    calculated one time, then used throughout the duration
    of the loop.

    This cleanup also makes experimenting with different TSO
    packetization schemes much easier.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • tcp_snd_test() does several different things, use inline
    functions to express this more clearly.

    1) It initializes the TSO count of SKB, if necessary.
    2) It performs the Nagle test.
    3) It makes sure the congestion window is adhered to.
    4) It makes sure SKB fits into the send window.

    This cleanup also sets things up so that things like the
    available packets in the congestion window does not need
    to be calculated multiple times by packet sending loops
    such as tcp_write_xmit().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • 'nonagle' should be passed to the tcp_snd_test() function
    as 'TCP_NAGLE_PUSH' if we are checking an SKB not at the
    tail of the write_queue. This is because Nagle does not
    apply to such frames since we cannot possibly tack more
    data onto them.

    However, while doing this __tcp_push_pending_frames() makes
    all of the packets in the write_queue use this modified
    'nonagle' value.

    Fix the bug and simplify this function by just calling
    tcp_write_xmit() directly if sk_send_head is non-NULL.

    As a result, we can now make tcp_data_snd_check() just call
    tcp_push_pending_frames() instead of the specialized
    __tcp_data_snd_check().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • tcp_write_xmit() uses tcp_current_mss(), but some of it's callers,
    namely __tcp_push_pending_frames(), already has this value available
    already.

    While we're here, fix the "cur_mss" argument to be "unsigned int"
    instead of plain "unsigned".

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Put the main basic block of work at the top-level of
    tabbing, and mark the TCP_CLOSE test with unlikely().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The tcp_cwnd_validate() function should only be invoked
    if we actually send some frames, yet __tcp_push_pending_frames()
    will always invoke it. tcp_write_xmit() does the call for us,
    so the call here can simply be removed.

    Also, tcp_write_xmit() can be marked static.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When we add any new packet to the TCP socket write queue,
    we must call skb_header_release() on it in order for the
    TSO sharing checks in the drivers to work.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • It reimplements portions of tcp_snd_check(), so it
    we move it to tcp_output.c we can consolidate it's
    logic much easier in a later change.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This just moves the code into tcp_output.c, no code logic changes are
    made by this patch.

    Using this as a baseline, we can begin to untangle the mess of
    comparisons for the Nagle test et al. We will also be able to reduce
    all of the redundant computation that occurs when outputting data
    packets.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • On each packet output, we call tcp_dec_quickack_mode()
    if the ACK flag is set. It drops tp->ack.quick until
    it hits zero, at which time we deflate the ATO value.

    When doing TSO, we are emitting multiple packets with
    ACK set, so we should decrement tp->ack.quick that many
    segments.

    Note that, unlike this case, tcp_enter_cwr() should not
    take the tcp_skb_pcount(skb) into consideration. That
    function, one time, readjusts tp->snd_cwnd and moves
    into TCP_CA_CWR state.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The ideal and most optimal layout for an SKB when doing
    scatter-gather is to put all the headers at skb->data, and
    all the user data in the page array.

    This makes SKB splitting and combining extremely simple,
    especially before a packet goes onto the wire the first
    time.

    So, when sk_stream_alloc_pskb() is given a zero size, make
    sure there is no skb_tailroom(). This is achieved by applying
    SKB_DATA_ALIGN() to the header length used here.

    Next, make select_size() in TCP output segmentation use a
    length of zero when NETIF_F_SG is true on the outgoing
    interface.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Below a patch to preallocate memory when doing resize of trie (inflate halve)
    If preallocations fails it just skips the resize of this tnode for this time.

    The oops we got when killing bgpd (with full routing) is now gone.
    Patrick memory patch is also used.

    Signed-off-by: Robert Olsson
    Signed-off-by: David S. Miller

    Robert Olsson
     
  • - rt_check_expire() fixes (an overflow occured if size of the hash
    was >= 65536)

    reminder of the bugfix:

    The rt_check_expire() has a serious problem on machines with large
    route caches, and a standard HZ value of 1000.

    With default values, ie ip_rt_gc_interval = 60*HZ = 60000 ;

    the loop count :

    for (t = ip_rt_gc_interval << rt_hash_log; t >= 0;

    overflows (t is a 31 bit value) as soon rt_hash_log is >= 16 (65536
    slots in route cache hash table).

    In this case, rt_check_expire() does nothing at all

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • - rt hash table allocated using alloc_large_system_hash() function.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • - Locking abstraction
    - Spinlocks moved out of rt hash table : Less memory (50%) used by rt
    hash table. it's a win even on UP.
    - Sizing of spinlocks table depends on NR_CPUS

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Inflating a node a couple of times makes it exceed the 128k kmalloc limit.
    Use __get_free_pages for allocations > PAGE_SIZE, as in fib_hash.

    Signed-off-by: Patrick McHardy
    Acked-by: Robert Olsson
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Makes IPv4 ip_rcv registration happen last in af_inet.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

29 Jun, 2005

4 commits

  • In 2.6.12 we started dropping the conntrack reference when a packet
    leaves the IP layer. This broke connection tracking on a bridge,
    because bridge-netfilter defers calling some NF_IP_* hooks to the bridge
    layer for locally generated packets going out a bridge, where the
    conntrack reference is no longer available. This patch keeps the
    reference in this case as a temporary solution, long term we will
    remove the defered hook calling. No attempt is made to drop the
    reference in the bridge-code when it is no longer needed, tc actions
    could already have sent the packet anywhere.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • In an smp system, it is possible for an connection timer to expire, calling
    ip_vs_conn_expire while the connection table is being flushed, before
    ct_write_lock_bh is acquired.

    Since the list iterator loop in ip_vs_con_flush releases and re-acquires the
    spinlock (even though it doesn't re-enable softirqs), it is possible for the
    expiration function to modify the connection list, while it is being traversed
    in ip_vs_conn_flush.

    The result is that the next pointer gets set to NULL, and subsequently
    dereferenced, resulting in an oops.

    Signed-off-by: Neil Horman
    Acked-by: JulianAnastasov
    Signed-off-by: David S. Miller

    Neil Horman
     
  • This should help up the insertion... but the resize is more crucial.
    and complex and needs some thinking.

    Signed-off-by: Robert Olsson
    Signed-off-by: David S. Miller

    Robert Olsson
     
  • I think there is a small bug in ipconfig.c in case IPCONFIG_DHCP is set
    and dhcp is used.

    When a DHCPOFFER is received, ip address is kept until we get DHCPACK.
    If no ack is received, ic_dynamic() returns negatively, but leaves the
    offered ip address in ic_myaddr.

    This makes the main loop in ip_auto_config() break and uses the maybe
    incomplete configuration.

    Not sure if it's the best way to do, but the following trivial patch
    correct this.

    Signed-off-by: Maxime Bizon
    Signed-off-by: David S. Miller

    Maxime Bizon