15 Nov, 2007

3 commits

  • On commit 39c90ece7565f5c47110c2fa77409d7a9478bd5b:

    [IPV4]: Convert rt_check_expire() from softirq processing to workqueue.

    we converted rt_check_expire() from softirq to workqueue, allowing the
    function to perform all work it was supposed to do.

    When the IP route cache is big, rt_check_expire() can take a long time
    to run. (default settings : 20% of the hash table is scanned at each
    invocation)

    Adding cond_resched() helps giving cpu to higher priority tasks if
    necessary.

    Using a "if (need_resched())" test before calling "cond_resched();" is
    necessary to avoid spending too much time doing the resched check.
    (My tests gave a time reduction from 88 ms to 25 ms per
    rt_check_expire() run on my i686 test machine)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • I broke this in commit 3de96471bd7fb76406e975ef6387abe3a0698149:

    [TCP]: Wrap-safed reordering detection FRTO check

    tcp_process_frto should always see a valid frto_highmark. An invalid
    frto_highmark (zero) is very likely what ultimately caused a seqno
    compare in tcp_frto_enter_loss to do the wrong leading to the LOST-bit
    leak.

    Having LOST-bits integry ensured like done after commit
    23aeeec365dcf8bc87fae44c533e50d0bb4f23cc:

    [TCP] FRTO: Plug potential LOST-bit leak

    won't hurt. It may still be useful in some other, possibly legimate,
    scenario.

    Reported by Chazarain Guillaume .

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • NULL ptr can be returned from tcp_write_queue_head to cached_skb
    and then assigned to skb if packets_out was zero. Without this,
    system is vulnerable to a carefully crafted ACKs which obviously
    is remotely triggerable.

    Besides, there's very little that needs to be done in sacktag
    if there weren't any packets outstanding, just skipping the rest
    doesn't hurt.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

14 Nov, 2007

2 commits

  • It might be possible that, in some extreme scenario that
    I just cannot now construct in my mind, end_seq .

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Otherwise TCP might violate packet ordering principles that FRTO
    is based on. If conventional recovery path is chosen, this won't
    be significant at all. In practice, any small enough value will
    be sufficient to provide proper operation for FRTO, yet other
    users of snd_cwnd might benefit from a "close enough" value.

    FRTO's formula is now equal to what tcp_enter_cwr() uses.

    FRTO used to check application limitedness a bit differently but
    I changed that in commit 575ee7140dabe9b9c4f66f4f867039b97e548867
    and as a result checking for application limitedness became
    completely non-existing.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

13 Nov, 2007

3 commits


11 Nov, 2007

9 commits

  • This patch fixes a small memory leak. Default fib rules can be deleted by
    the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by
    ip rule flush

    Such a rule will not be freed as the ref-counter has 2 on start and becomes
    clearly unreachable after removal.

    Signed-off-by: Denis V. Lunev
    Acked-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • Both check for the family to select an appropriate tunnel list.
    Consolidate this check and make the for() loop more readable.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The tunnel64_protocol uses the tunnel4_protocol's err_handler and
    thus calls the tunnel4_protocol's handlers.

    This is not very good, as in case of (icmp) error the wrong error
    handlers will be called (e.g. ipip ones instead of sit) and this
    won't be noticed at all, because the error is not reported.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • There are many places that get the dst entry, increase the
    __use counter and set the "lastuse" time stamp.

    Make a helper for this.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Both places look like

    if (err == XXX)
    goto yyy;
    done:

    while both yyy targets look like

    err = XXX;
    goto done;

    so this is ok to remove the above if-s.

    yyy labels are used in other places and are not removed.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • In case we run out of mem when fragmenting, the clearing of
    FLAG_ONLY_ORIG_SACKED might get missed which then feeds FRTO
    with false information. Move clearing outside skb processing
    loop so that it will get executed even if the skb loop
    terminates prematurely due to out-of-mem.

    Besides, now the core of the loop truly deals with a single
    skb only, which also enables creation a more self-contained
    of tcp_sacktag_one later on.

    In addition, small reorganization of if branches was made.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Fixes subtle bug like the one with fastpath_cnt_hint happening
    due to the way the GSO and hints interact. Because hints are not
    reset when just a GSOed skb is partially ACKed, there's no
    guarantee that the relevant part of the write queue is going to
    be processed in sacktag at all (skbs below snd_una) because
    fastpath hint can fast forward the entrypoint.

    This was also on the way of future reductions in sacktag's skb
    processing. Also future cleanups in sacktag can be made after
    this (in 2.6.25).

    This may make reordering update in tcp_try_undo_partial
    redundant but I'm not too sure so I left it there.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Reordering detection fails to take account that the reordered
    skb may have pcount larger than 1. In such case the lowest of
    them had the largest reordering, the old formula used the
    highest of them which is pcount - 1 packets less reordered.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

07 Nov, 2007

12 commits

  • As done two years ago on IP route cache table (commit
    22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one
    lock per hash bucket for the huge TCP/DCCP hash tables.

    On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
    litle performance differences. (we hit a different cache line for the
    rwlock, but then the bucket cache line have a better sharing factor
    among cpus, since we dirty it less often). For netstat or ss commands
    that want a full scan of hash table, we perform fewer memory accesses.

    Using a 'small' table of hashed rwlocks should be more than enough to
    provide correct SMP concurrency between different buckets, without
    using too much memory. Sizing of this table depends on
    num_possible_cpus() and various CONFIG settings.

    This patch provides some locking abstraction that may ease a future
    work using a different model for TCP/DCCP table.

    Signed-off-by: Eric Dumazet
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch makes the master daemon to sync the connection when it is about
    to close. This makes the connections on the backup to close or timeout
    according their state. Before the sync was performed only if the
    connection is in ESTABLISHED state which always made the connections to
    timeout in the hard coded 3 minutes. However the Andy Gospodarek's patch
    ([IPVS]: use proper timeout instead of fixed value) effectively did nothing
    more than increasing this to 15 minutes (Established state timeout). So
    this patch makes use of proper timeout since it syncs the connections on
    status changes to FIN_WAIT (2min timeout) and CLOSE (10sec timeout).
    However if the backup misses CLOSE hopefully it did not miss FIN_WAIT.
    Otherwise we will just have to wait for the ESTABLISHED state timeout. As
    it is without this patch. This way the number of the hanging connections
    on the backup is kept to minimum. And very few of them will be left to
    timeout with a long timeout.

    This is important if we want to make use of the fix for the real server
    overcommit on master/backup fail-over.

    Signed-off-by: Rumen G. Bogdanovski
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Rumen G. Bogdanovski
     
  • This patch fixes the problem with node overload on director fail-over.
    Given the scenario: 2 nodes each accepting 3 connections at a time and 2
    directors, director failover occurs when the nodes are fully loaded (6
    connections to the cluster) in this case the new director will assign
    another 6 connections to the cluster, If the same real servers exist
    there.

    The problem turned to be in not binding the inherited connections to
    the real servers (destinations) on the backup director. Therefore:
    "ipvsadm -l" reports 0 connections:
    root@test2:~# ipvsadm -l
    IP Virtual Server version 1.2.1 (size=4096)
    Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port Forward Weight ActiveConn InActConn
    TCP test2.local:5999 wlc
    -> node473.local:5999 Route 1000 0 0
    -> node484.local:5999 Route 1000 0 0

    while "ipvs -lnc" is right
    root@test2:~# ipvsadm -lnc
    IPVS connection entries
    pro expire state source virtual destination
    TCP 14:56 ESTABLISHED 192.168.0.10:39164 192.168.0.222:5999
    192.168.0.51:5999
    TCP 14:59 ESTABLISHED 192.168.0.10:39165 192.168.0.222:5999
    192.168.0.52:5999

    So the patch I am sending fixes the problem by binding the received
    connections to the appropriate service on the backup director, if it
    exists, else the connection will be handled the old way. So if the
    master and the backup directors are synchronized in terms of real
    services there will be no problem with server over-committing since
    new connections will not be created on the nonexistent real services
    on the backup. However if the service is created later on the backup,
    the binding will be performed when the next connection update is
    received. With this patch the inherited connections will show as
    inactive on the backup:

    root@test2:~# ipvsadm -l
    IP Virtual Server version 1.2.1 (size=4096)
    Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port Forward Weight ActiveConn InActConn
    TCP test2.local:5999 wlc
    -> node473.local:5999 Route 1000 0 1
    -> node484.local:5999 Route 1000 0 1

    rumen@test2:~$ cat /proc/net/ip_vs
    IP Virtual Server version 1.2.1 (size=4096)
    Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port Forward Weight ActiveConn InActConn
    TCP C0A800DE:176F wlc
    -> C0A80033:176F Route 1000 0 1
    -> C0A80032:176F Route 1000 0 1

    Regards,
    Rumen Bogdanovski

    Acked-by: Julian Anastasov
    Signed-off-by: Rumen G. Bogdanovski
    Signed-off-by: Simon Horman

    Rumen G. Bogdanovski
     
  • The function crypto_alloc_comp returns an errno instead of NULL
    to indicate error. So it needs to be tested with IS_ERR.

    This is based on a patch by Vicenç Beltran Querol.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • There are places that check for CONFIG_IP_MULTIPLE_TABLES
    twice in the same file, but the internals of these #ifdefs
    can be merged.

    As a side effect - remove one ifdef from inside a function.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Trivial patch to make "tcp,udp,udplite,raw" protocols uses the fast
    "inuse sockets" infrastructure

    Each protocol use then a static percpu var, instead of a dynamic one.
    This saves some ram and some cpu cycles

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • "struct proto" currently uses an array stats[NR_CPUS] to track change on
    'inuse' sockets per protocol.

    If NR_CPUS is big, this means we use a big memory area for this.
    Moreover, all this memory area is located on a single node on NUMA
    machines, increasing memory pressure on the boot node.

    In this patch, I tried to :

    - Keep a fast !CONFIG_SMP implementation
    - Keep a fast CONFIG_SMP implementation for often used protocols
    (tcp,udp,raw,...)
    - Introduce a NUMA efficient implementation

    Some helper macros are defined in include/net/sock.h
    These macros take into account CONFIG_SMP

    If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
    REF_PROTO_INUSE
    macros, it will automatically use a default implementation, using a
    dynamically allocated percpu zone.
    This default implementation will be NUMA efficient, but might use 32/64
    bytes per possible cpu
    because of current alloc_percpu() implementation.
    However it still should be better than previous implementation based on
    stats[NR_CPUS] field.

    When a "struct proto" is changed to use the new macros, we use a single
    static "int" percpu variable,
    lowering the memory and cpu costs, still preserving NUMA efficiency.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The #idfed CONFIG_IP_MROUTE is sometimes places inside the if-s,
    which looks completely bad. Similar ifdefs inside the functions
    looks a bit better, but they are also not recommended to be used.

    Provide an ifdef-ed ip_mroute_opt() helper to cleanup the code.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The ip_push_pending_frames and ip_flush_pending_frames do the
    same things to flush the sock's cork. Move this into a separate
    function and save ~80 bytes from the .text

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • As noticed by Paul McKenney, the rcu_dereference calls in the init path
    of NAT modules are unneeded, remove them.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Sort matches and targets in the NF makefiles.

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jan Engelhardt
     
  • I plan to kill ->get_info which means killing proc_net_create().

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

02 Nov, 2007

2 commits

  • sg_mark_end() overwrites the page_link information, but all users want
    __sg_mark_end() behaviour where we just set the end bit. That is the most
    natural way to use the sg list, since you'll fill it in and then mark the
    end point.

    So change sg_mark_end() to only set the termination bit. Add a sg_magic
    debug check as well, and clear a chain pointer if it is set.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Not architecture specific code should not #include .

    This patch therefore either replaces them with
    #include or simply removes them if they were
    unused.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Jens Axboe

    Adrian Bunk
     

01 Nov, 2007

3 commits

  • Finally, the zero_it argument can be completely removed from
    the callers and from the function prototype.

    Besides, fix the checkpatch.pl warnings about using the
    assignments inside if-s.

    This patch is rather big, and it is a part of the previous one.
    I splitted it wishing to make the patches more readable. Hope
    this particular split helped.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Similar to commit 3eec0047d9bdd, point of this is to avoid
    skipping R-bit skbs.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • DSACK inside another SACK block were missed if start_seq of DSACK
    was larger than SACK block's because sorting prioritizes full
    processing of the SACK block before DSACK. After SACK block
    sorting situation is like this:

    SSSSSSSSS
    D
    SSSSSS
    SSSSSSS

    Because write_queue is walked in-order, when the first SACK block
    has been processed, TCP is already past the skb for which the
    DSACK arrived and we haven't taught it to backtrack (nor should
    we), so TCP just continues processing by going to the next SACK
    block after the DSACK (if any).

    Whenever such DSACK is present, do an embedded checking during
    the previous SACK block.

    If the DSACK is below snd_una, there won't be overlapping SACK
    block, and thus no problem in that case. Also if start_seq of
    the DSACK is equal to the actual block, it will be processed
    first.

    Tested this by using netem to duplicate 15% of packets, and
    by printing SACK block when found_dup_sack is true and the
    selected skb in the dup_sack = 1 branch (if taken):

    SACK block 0: 4344-5792 (relative to snd_una 2019137317)
    SACK block 1: 4344-5792 (relative to snd_una 2019137317)

    equal start seqnos => next_dup = 0, dup_sack = 1 won't occur...

    SACK block 0: 5792-7240 (relative to snd_una 2019214061)
    SACK block 1: 2896-7240 (relative to snd_una 2019214061)
    DSACK skb match 5792-7240 (relative to snd_una)

    ...and next_dup = 1 case (after the not shown start_seq sort),
    went to dup_sack = 1 branch.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

31 Oct, 2007

3 commits

  • This fixes scatterlist corruptions added by

    commit 68e3f5dd4db62619fdbe520d36c9ebf62e672256
    [CRYPTO] users: Fix up scatterlist conversion errors

    The issue is that the code calls sg_mark_end() which clobbers the
    sg_page() pointer of the final scatterlist entry.

    The first part fo the fix makes skb_to_sgvec() do __sg_mark_end().

    After considering all skb_to_sgvec() call sites the most correct
    solution is to call __sg_mark_end() in skb_to_sgvec() since that is
    what all of the callers would end up doing anyways.

    I suspect this might have fixed some problems in virtio_net which is
    the sole non-crypto user of skb_to_sgvec().

    Other similar sg_mark_end() cases were converted over to
    __sg_mark_end() as well.

    Arguably sg_mark_end() is a poorly named function because it doesn't
    just "mark", it clears out the page pointer as a side effect, which is
    what led to these bugs in the first place.

    The one remaining plain sg_mark_end() call is in scsi_alloc_sgtable()
    and arguably it could be converted to __sg_mark_end() if only so that
    we can delete this confusing interface from linux/scatterlist.h

    Signed-off-by: David S. Miller

    David S. Miller
     
  • It's under CONFIG_IP_VS_LBLCR_DEBUG option which never existed.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • Fix links to files in Documentation/* in various Kconfig files

    Signed-off-by: Dirk Hohndel
    Signed-off-by: Linus Torvalds

    Dirk Hohndel
     

30 Oct, 2007

3 commits

  • On systems with a very large amount of memory, the heuristics in
    alloc_large_system_hash() result in a very large TCP established hash
    table: 16 millions of entries for a 128 GB ia64 system. This makes
    reading from /proc/net/tcp pretty slow (well over a second) and as a
    result netstat is slow on these machines. I know that /proc/net/tcp is
    deprecated in favor of tcp_diag, however at the moment netstat only
    knows of the former.

    I am skeptical that such a large TCP established hash is often needed.
    Just because a system has a lot of memory doesn't imply that it will
    have several millions of concurrent TCP connections. Thus I believe
    that we should put an arbitrary high limit to the size of the TCP
    established hash by default. Users who really need a bigger hash can
    always use the thash_entries boot parameter to get more.

    I propose 2 millions of entries as the arbitrary high limit. This
    makes /proc/net/tcp reasonably fast on the system in question (0.2 s)
    while being still large enough for me to be confident that network
    performance won't suffer.

    This is just one way to limit the hash size, there are others; I am not
    familiar enough with the TCP code to decide which is best. Thus, I
    would welcome the proposals of alternatives.

    [ 2 million is still too large, thus I've modified the limit in the
    change to be '512 * 1024'. -DaveM ]

    Signed-off-by: Jean Delvare
    Signed-off-by: David S. Miller

    Jean Delvare
     
  • While displaying ICMP out-going statistics as Out counters in
    /proc/net/snmp, the memory location for ICMP in-coming statistics
    was referred by mistake.

    Signed-off-by: Mitsuru Chinen
    Acked-by: David L Stevens
    Signed-off-by: David S. Miller

    Mitsuru Chinen
     
  • while reviewing the tcp_md5-related code further i came across with
    another two of these casts which you probably have missed. I don't
    actually think that they impose a problem by now, but as you said we
    should remove them.

    Signed-off-by: Matthias M. Dellweg
    Signed-off-by: David S. Miller

    Matthias M. Dellweg