28 Oct, 2005

2 commits

  • Linus Torvalds
     
  • This bug is responsible for causing the infamous "Treason uncloaked"
    messages that's been popping up everywhere since the printk was added.
    It has usually been blamed on foreign operating systems. However,
    some of those reports implicate Linux as both systems are running
    Linux or the TCP connection is going across the loopback interface.

    In fact, there really is a bug in the Linux TCP header prediction code
    that's been there since at least 2.1.8. This bug was tracked down with
    help from Dale Blount.

    The effect of this bug ranges from harmless "Treason uncloaked"
    messages to hung/aborted TCP connections. The details of the bug
    and fix is as follows.

    When snd_wnd is updated, we only update pred_flags if
    tcp_fast_path_check succeeds. When it fails (for example,
    when our rcvbuf is used up), we will leave pred_flags with
    an out-of-date snd_wnd value.

    When the out-of-date pred_flags happens to match the next incoming
    packet we will again hit the fast path and use the current snd_wnd
    which will be wrong.

    In the case of the treason messages, it just happens that the snd_wnd
    cached in pred_flags is zero while tp->snd_wnd is non-zero. Therefore
    when a zero-window packet comes in we incorrectly conclude that the
    window is non-zero.

    In fact if the peer continues to send us zero-window pure ACKs we
    will continue making the same mistake. It's only when the peer
    transmits a zero-window packet with data attached that we get a
    chance to snap out of it. This is what triggers the treason
    message at the next retransmit timeout.

    Signed-off-by: Herbert Xu
    Signed-off-by: Arnaldo Carvalho de Melo

    Herbert Xu
     

26 Oct, 2005

5 commits

  • Fix setting of the broadcast address when the netmask is set via
    SIOCSIFNETMASK in Linux 2.6. The code wanted the old value of
    ifa->ifa_mask but used it after it had already been overwritten with
    the new value.

    Signed-off-by: David Engel
    Signed-off-by: Arnaldo Carvalho de Melo

    David Engel
     
  • skb_prev is assigned from skb, which cannot be NULL. This patch removes the
    unnecessary NULL check.

    Signed-off-by: Jayachandran C.
    Acked-by: James Morris
    Signed-off-by: Arnaldo Carvalho de Melo

    Jayachandran C
     
  • This patch kills a redundant rcu_dereference on fa->fa_info in fib_trie.c.
    As this dereference directly follows a list_for_each_entry_rcu line, we
    have already taken a read barrier with respect to getting an entry from
    the list.

    This read barrier guarantees that all values read out of fa are valid.
    In particular, the contents of structure pointed to by fa->fa_info is
    initialised before fa->fa_info is actually set (see fn_trie_insert);
    the setting of fa->fa_info itself is further separated with a write
    barrier from the insertion of fa into the list.

    Therefore by taking a read barrier after obtaining fa from the list
    (which is given by list_for_each_entry_rcu), we can be sure that
    fa->fa_info contains a valid pointer, as well as the fact that the
    data pointed to by fa->fa_info is itself valid.

    Signed-off-by: Herbert Xu
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller
    Signed-off-by: Arnaldo Carvalho de Melo

    Herbert Xu
     
  • It's fairly simple to resize the hash table, but currently you need to
    remove and reinsert the module. That's bad (we lose connection
    state). Harald has even offered to write a daemon which sets this
    based on load.

    Signed-off-by: Rusty Russell
    Signed-off-by: Harald Welte
    Signed-off-by: Arnaldo Carvalho de Melo

    Harald Welte
     
  • In 'net' change the explicit use of for-loops and NR_CPUS into the
    general for_each_cpu() or for_each_online_cpu() constructs, as
    appropriate. This widens the scope of potential future optimizations
    of the general constructs, as well as takes advantage of the existing
    optimizations of first_cpu() and next_cpu(), which is advantageous
    when the true CPU count is much smaller than NR_CPUS.

    Signed-off-by: John Hawkes
    Signed-off-by: David S. Miller
    Signed-off-by: Arnaldo Carvalho de Melo

    John Hawkes
     

23 Oct, 2005

1 commit

  • IPVS used flag NFC_IPVS_PROPERTY in nfcache but as now nfcache was removed the
    new flag 'ipvs_property' still needs to be copied. This patch should be
    included in 2.6.14.

    Further comments from Harald Welte:

    Sorry, seems like the bug was introduced by me.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Harald Welte
    Signed-off-by: Arnaldo Carvalho de Melo

    Julian Anastasov
     

21 Oct, 2005

1 commit

  • It is legitimate to call tcp_fragment with len == skb->len since
    that is done for FIN packets and the FIN flag counts as one byte.
    So we should only check for the len > skb->len case.

    Signed-off-by: Herbert Xu
    Signed-off-by: Arnaldo Carvalho de Melo

    Herbert Xu
     

14 Oct, 2005

2 commits


13 Oct, 2005

1 commit


11 Oct, 2005

11 commits


09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

06 Oct, 2005

1 commit


05 Oct, 2005

3 commits

  • From: Randy Dunlap

    Fix implicit nocast warnings in ip_vs code:
    net/ipv4/ipvs/ip_vs_app.c:631:54: warning: implicit cast to nocast type

    Signed-off-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Randy Dunlap
     
  • Signed-off-by: Horst H. von Brand
    Signed-off-by: David S. Miller

    Horst H. von Brand
     
  • The patch below introduces special thresholds to keep root node in the trie
    large. This gives a flatter tree at the cost of a modest memory increase.
    Overall it seems to be gain and this was also proposed by one the authors
    of the paper in recent a seminar.

    Main table after loading 123 k routes.

    Aver depth: 3.30
    Max depth: 9
    Root-node size 12 bits
    Total size: 4044 kB

    With the patch:
    Aver depth: 2.78
    Max depth: 8
    Root-node size 15 bits
    Total size: 4150 kB

    An increase of 8-10% was seen in forwading performance for an rDoS attack.

    Signed-off-by: Robert Olsson
    Signed-off-by: David S. Miller

    Robert Olsson
     

04 Oct, 2005

5 commits

  • It's not a good idea to be smurf'able by default.
    The few people who need this can turn it on.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
    introduces __in_dev_get_rcu() to cover the second case.

    1) RCU with refcnt should use in_dev_get().
    2) RCU without refcnt should use __in_dev_get_rcu().
    3) All others must hold RTNL and use __in_dev_get_rtnl().

    There is one exception in net/ipv4/route.c which is in fact a pre-existing
    race condition. I've marked it as such so that we remember to fix it.

    This patch is based on suggestions and prior work by Suzanne Wood and
    Paul McKenney.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Meelis Roos wrote:
    > RK> My firewall setup relies on proxyarp working. However, with 2.6.14-rc3,
    > RK> it appears to be completely broken. The firewall is 212.18.232.186,
    >
    > Same here with some kernel between 14-rc2 and 14-rc3 - no reposnse to
    > ARP on a proxyarp gateway. Sorry, no exact revison and no more debugging
    > yet since it'a a production gateway.

    The breakage is caused by the change to use the CB area for flagging
    whether a packet has been queued due to proxy_delay. This area gets
    cleared every time arp_rcv gets called. Unfortunately packets delayed
    due to proxy_delay also go through arp_rcv when they are reprocessed.

    In fact, I can't think of a reason why delayed proxy packets should go
    through netfilter again at all. So the easiest solution is to bypass
    that and go straight to arp_process.

    This is essentially what would've happened before netfilter support
    was added to ARP.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Arnaldo and I agreed it could be applied now, because I have other
    pending patches depending on this one (Thank you Arnaldo)

    (The other important patch moves skc_refcnt in a separate cache line,
    so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)

    1) First some performance data :
    --------------------------------

    tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()

    The most time critical code is :

    sk_for_each(sk, node, &head->chain) {
    if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
    goto hit; /* You sunk my battleship! */
    }

    The sk_for_each() does use prefetch() hints but only the begining of
    "struct sock" is prefetched.

    As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
    away from the begining of "struct sock", it has to bring into CPU
    cache cold cache line. Each iteration has to use at least 2 cache
    lines.

    This can be problematic if some chains are very long.

    2) The goal
    -----------

    The idea I had is to change things so that INET_MATCH() may return
    FALSE in 99% of cases only using the data already in the CPU cache,
    using one cache line per iteration.

    3) Description of the patch
    ---------------------------

    Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
    filling a 32 bits hole on 64 bits platform.

    struct sock_common {
    unsigned short skc_family;
    volatile unsigned char skc_state;
    unsigned char skc_reuse;
    int skc_bound_dev_if;
    struct hlist_node skc_node;
    struct hlist_node skc_bind_node;
    atomic_t skc_refcnt;
    + unsigned int skc_hash;
    struct proto *skc_prot;
    };

    Store in this 32 bits field the full hash, not masked by (ehash_size -
    1) Using this full hash as the first comparison done in INET_MATCH
    permits us immediatly skip the element without touching a second cache
    line in case of a miss.

    Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
    sk_hash and tw_hash) already contains the slot number if we mask with
    (ehash_size - 1)

    File include/net/inet_hashtables.h

    64 bits platforms :
    #define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
    (((__sk)->sk_hash == (__hash))
    ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \
    ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \
    (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))

    32bits platforms:
    #define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
    (((__sk)->sk_hash == (__hash)) && \
    (inet_sk(__sk)->daddr == (__saddr)) && \
    (inet_sk(__sk)->rcv_saddr == (__daddr)) && \
    (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))

    - Adds a prefetch(head->chain.first) in
    __inet_lookup_established()/__tcp_v4_check_established() and
    __inet6_lookup_established()/__tcp_v6_check_established() and
    __dccp_v4_check_established() to bring into cache the first element of the
    list, before the {read|write}_lock(&head->lock);

    Signed-off-by: Eric Dumazet
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • I've found the problem in general. It affects any 64-bit
    architecture. The problem occurs when you change the system time.

    Suppose that when you boot your system clock is forward by a day.
    This gets recorded down in skb_tv_base. You then wind the clock back
    by a day. From that point onwards the offset will be negative which
    essentially overflows the 32-bit variables they're stored in.

    In fact, why don't we just store the real time stamp in those 32-bit
    variables? After all, we're not going to overflow for quite a while
    yet.

    When we do overflow, we'll need a better solution of course.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

30 Sep, 2005

2 commits

  • From: Alexey Kuznetsov

    Handle better the case where the sender sends full sized
    frames initially, then moves to a mode where it trickles
    out small amounts of data at a time.

    This known problem is even mentioned in the comments
    above tcp_grow_window() in tcp_input.c, specifically:

    ...
    * The scheme does not work when sender sends good segments opening
    * window and then starts to feed us spagetti. But it should work
    * in common situations. Otherwise, we have to rely on queue collapsing.
    ...

    When the sender gives full sized frames, the "struct sk_buff" overhead
    from each packet is small. So we'll advertize a larger window.
    If the sender moves to a mode where small segments are sent, this
    ratio becomes tilted to the other extreme and we start overrunning
    the socket buffer space.

    tcp_clamp_window() tries to address this, but it's clamping of
    tp->window_clamp is a wee bit too aggressive for this particular case.

    Fix confirmed by Ion Badulescu.

    Signed-off-by: David S. Miller

    Alexey Kuznetsov
     
  • But retain the comment fix.

    Alexey Kuznetsov has explained the situation as follows:

    --------------------

    I think the fix is incorrect. Look, the RFC function init_cwnd(mss) is
    not continuous: f.e. for mss=1095 it needs initial window 1095*4, but
    for mss=1096 it is 1096*3. We do not know exactly what mss sender used
    for calculations. If we advertised 1096 (and calculate initial window
    3*1096), the sender could limit it to some value < 1096 and then it
    will need window his_mss*4 > 3*1096 to send initial burst.

    See?

    So, the honest function for inital rcv_wnd derived from
    tcp_init_cwnd() is:

    init_rcv_wnd(mss)=
    min { init_cwnd(mss1)*mss1 for mss1 < 1096)
    return mss*4;
    if (mss < 1096*2)
    return 1096*4;
    return mss*2;

    (I just scrablled a graph of piece of paper, it is difficult to see or
    to explain without this)

    I selected it differently giving more window than it is strictly
    required. Initial receive window must be large enough to allow sender
    following to the rfc (or just setting initial cwnd to 2) to send
    initial burst. But besides that it is arbitrary, so I decided to give
    slack space of one segment.

    Actually, the logic was:

    If mss is low/normal (

    David S. Miller
     

29 Sep, 2005

1 commit


27 Sep, 2005

1 commit

  • When you've enabled conntrack and NAT as a module (standard case in all
    distributions), and you've also enabled the new conntrack netlink
    interface, loading ip_conntrack_netlink.ko will auto-load iptable_nat.ko.
    This causes a huge performance penalty, since for every packet you iterate
    the nat code, even if you don't want it.

    This patch splits iptable_nat.ko into the NAT core (ip_nat.ko) and the
    iptables frontend (iptable_nat.ko). Threfore, ip_conntrack_netlink.ko will
    only pull ip_nat.ko, but not the frontend. ip_nat.ko will "only" allocate
    some resources, but not affect runtime performance.

    This separation is also a nice step in anticipation of new packet filters
    (nf-hipac, ipset, pkttables) being able to use the NAT core.

    Signed-off-by: Harald Welte
    Signed-off-by: David S. Miller

    Harald Welte
     

25 Sep, 2005

2 commits


23 Sep, 2005

1 commit

  • This patch fixes a number of bugs. It cannot be reasonably split up in
    multiple fixes, since all bugs interact with each other and affect the same
    function:

    Bug #1:
    The event cache code cannot be called while a lock is held. Therefore, the
    call to ip_conntrack_event_cache() within ip_ct_refresh_acct() needs to be
    moved outside of the locked section. This fixes a number of 2.6.14-rcX
    oops and deadlock reports.

    Bug #2:
    We used to call ct_add_counters() for unconfirmed connections without
    holding a lock. Since the add operations are not atomic, we could race
    with another CPU.

    Bug #3:
    ip_ct_refresh_acct() lost REFRESH events in some cases where refresh
    (and the corresponding event) are desired, but no accounting shall be
    performed. Both, evenst and accounting implicitly depended on the skb
    parameter bein non-null. We now re-introduce a non-accounting
    "ip_ct_refresh()" variant to explicitly state the desired behaviour.

    Signed-off-by: Harald Welte
    Signed-off-by: David S. Miller

    Harald Welte