07 Nov, 2007

2 commits

  • As done two years ago on IP route cache table (commit
    22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one
    lock per hash bucket for the huge TCP/DCCP hash tables.

    On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
    litle performance differences. (we hit a different cache line for the
    rwlock, but then the bucket cache line have a better sharing factor
    among cpus, since we dirty it less often). For netstat or ss commands
    that want a full scan of hash table, we perform fewer memory accesses.

    Using a 'small' table of hashed rwlocks should be more than enough to
    provide correct SMP concurrency between different buckets, without
    using too much memory. Sizing of this table depends on
    num_possible_cpus() and various CONFIG settings.

    This patch provides some locking abstraction that may ease a future
    work using a different model for TCP/DCCP table.

    Signed-off-by: Eric Dumazet
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Trivial patch to make "tcp,udp,udplite,raw" protocols uses the fast
    "inuse sockets" infrastructure

    Each protocol use then a static percpu var, instead of a dynamic one.
    This saves some ram and some cpu cycles

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Nov, 2007

1 commit

  • sg_mark_end() overwrites the page_link information, but all users want
    __sg_mark_end() behaviour where we just set the end bit. That is the most
    natural way to use the sg list, since you'll fill it in and then mark the
    end point.

    So change sg_mark_end() to only set the termination bit. Add a sg_magic
    debug check as well, and clear a chain pointer if it is set.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Oct, 2007

1 commit

  • This fixes scatterlist corruptions added by

    commit 68e3f5dd4db62619fdbe520d36c9ebf62e672256
    [CRYPTO] users: Fix up scatterlist conversion errors

    The issue is that the code calls sg_mark_end() which clobbers the
    sg_page() pointer of the final scatterlist entry.

    The first part fo the fix makes skb_to_sgvec() do __sg_mark_end().

    After considering all skb_to_sgvec() call sites the most correct
    solution is to call __sg_mark_end() in skb_to_sgvec() since that is
    what all of the callers would end up doing anyways.

    I suspect this might have fixed some problems in virtio_net which is
    the sole non-crypto user of skb_to_sgvec().

    Other similar sg_mark_end() cases were converted over to
    __sg_mark_end() as well.

    Arguably sg_mark_end() is a poorly named function because it doesn't
    just "mark", it clears out the page pointer as a side effect, which is
    what led to these bugs in the first place.

    The one remaining plain sg_mark_end() call is in scsi_alloc_sgtable()
    and arguably it could be converted to __sg_mark_end() if only so that
    we can delete this confusing interface from linux/scatterlist.h

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Oct, 2007

1 commit


26 Oct, 2007

1 commit


11 Oct, 2007

2 commits

  • Expansion of original idea from Denis V. Lunev

    Add robustness and locking to the local_port_range sysctl.
    1. Enforce that low < high when setting.
    2. Use seqlock to ensure atomic update.

    The locking might seem like overkill, but there are
    cases where sysadmin might want to change value in the
    middle of a DoS attack.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

29 Sep, 2007

1 commit

  • Based upon a report and initial patch by Peter Lieven.

    tcp4_md5sig_key and tcp6_md5sig_key need to start with
    the exact same members as tcp_md5sig_key. Because they
    are both cast to that type by tcp_v{4,6}_md5_do_lookup().

    Unfortunately tcp{4,6}_md5sig_key use a u16 for the key
    length instead of a u8, which is what tcp_md5sig_key
    uses. This just so happens to work by accident on
    little-endian, but on big-endian it doesn't.

    Instead of casting, just place tcp_md5sig_key as the first member of
    the address-family specific structures, adjust the access sites, and
    kill off the ugly casts.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Aug, 2007

1 commit

  • As discovered by Evegniy Polyakov, if we try to sendmsg after
    a connection reset, we can do incredibly stupid things.

    The core issue is that inet_sendmsg() tries to autobind the
    socket, but we should never do that for TCP. Instead we should
    just go straight into TCP's sendmsg() code which will do all
    of the necessary state and pending socket error checks.

    TCP's sendpage already directly vectors to tcp_sendpage(), so this
    merely brings sendmsg() in line with that.

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Jul, 2007

1 commit

  • Currently the code for /proc/net/tcp disable BH while iterating
    over the entire established hash table. Even though we call
    cond_resched_softirq for each entry, we still won't process
    softirq's as regularly as we would otherwise do which results
    in poor performance when the system is loaded near capacity.

    This anomaly comes from the 2.4 code where this was all in a
    single function and the local_bh_disable might have made sense
    as a small optimisation.

    The cost of each local_bh_disable is so small when compared
    against the increased latency in keeping it disabled over a
    large but mostly empty TCP established hash table that we
    should just move it to the individual read_lock/read_unlock
    calls as we do in inet_diag.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

13 Jun, 2007

1 commit


08 Jun, 2007

1 commit

  • A time_wait socket inherits sk_bound_dev_if from the original socket,
    but it is not used when sending ACK packets using ip_send_reply.

    Fix by passing the oif to ip_send_reply in struct ip_reply_arg and
    use it for output routing.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

04 Jun, 2007

1 commit


26 Apr, 2007

10 commits


11 Feb, 2007

1 commit


09 Feb, 2007

3 commits

  • ehash table layout is currently this one :

    First half of this table is used by sockets not in TIME_WAIT state
    Second half of it is used by sockets in TIME_WAIT state.

    This is non optimal because of for a given hash or socket, the two chain heads
    are located in separate cache lines.
    Moreover the locks of the second half are never used.

    If instead of this halving, we use two list heads in inet_ehash_bucket instead
    of only one, we probably can avoid one cache miss, and reduce ram usage,
    particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK,
    CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep
    together related chains to speedup lookups and socket state change.

    In this patch I did not try to align struct inet_ehash_bucket, but a future
    patch could try to make this structure have a convenient size (a power of two
    or a multiple of L1_CACHE_SIZE).
    I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so
    maybe we dont need to scratch our heads to align the bucket...

    Note : In case struct inet_ehash_bucket is not a power of two, we could
    probably change alloc_large_system_hash() (in case it use __get_free_pages())
    to free the unused space. It currently allocates a big zone, but the last
    quarter of it could be freed. Again, this should be a temporary 'problem'.

    Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Do this even for non-blocking sockets. This avoids the silly -EAGAIN
    that applications can see now, even for non-blocking sockets in some
    cases (f.e. connect()).

    With help from Venkat Tekkirala.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The tcphdr struct passed to tcp_v4_check is not used, the following
    patch removes it from the parameter list.

    This adds the netfilter modifications missing in the patch I sent
    for rc3-mm1.

    Signed-off-by: Frederik Deweerdt
    Signed-off-by: David S. Miller

    Frederik Deweerdt
     

09 Jan, 2007

1 commit


18 Dec, 2006

2 commits


03 Dec, 2006

9 commits