Eric Lee / smarc-fsl-linux-kernel

07 Nov, 2007

2 commits

230140cff [INET]: Remove per bucket rwlock in tcp/dccp ehash table. ... Browse Code »

As done two years ago on IP route cache table (commit
22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.

On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.

Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.

This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.

Signed-off-by: Eric Dumazet
Acked-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Eric Dumazet
2007-11-07 20:15:11 +0800
47a31a6ff [IPV4]: Use the {DEFINE|REF}_PROTO_INUSE infrastructure ... Browse Code »

Trivial patch to make "tcp,udp,udplite,raw" protocols uses the fast
"inuse sockets" infrastructure

Each protocol use then a static percpu var, instead of a dynamic one.
This saves some ram and some cpu cycles

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2007-11-07 20:08:58 +0800

02 Nov, 2007

1 commit

c46f2334c [SG] Get rid of __sg_mark_end() ... Browse Code »

sg_mark_end() overwrites the page_link information, but all users want
__sg_mark_end() behaviour where we just set the end bit. That is the most
natural way to use the sg list, since you'll fill it in and then mark the
end point.

So change sg_mark_end() to only set the termination bit. Add a sg_magic
debug check as well, and clear a chain pointer if it is set.

Signed-off-by: Jens Axboe

Jens Axboe
2007-11-02 15:47:06 +0800

31 Oct, 2007

1 commit

51c739d1f [NET]: Fix incorrect sg_mark_end() calls. ... Browse Code »

This fixes scatterlist corruptions added by

commit 68e3f5dd4db62619fdbe520d36c9ebf62e672256
[CRYPTO] users: Fix up scatterlist conversion errors

The issue is that the code calls sg_mark_end() which clobbers the
sg_page() pointer of the final scatterlist entry.

The first part fo the fix makes skb_to_sgvec() do __sg_mark_end().

After considering all skb_to_sgvec() call sites the most correct
solution is to call __sg_mark_end() in skb_to_sgvec() since that is
what all of the callers would end up doing anyways.

I suspect this might have fixed some problems in virtio_net which is
the sole non-crypto user of skb_to_sgvec().

Other similar sg_mark_end() cases were converted over to
__sg_mark_end() as well.

Arguably sg_mark_end() is a poorly named function because it doesn't
just "mark", it clears out the page pointer as a side effect, which is
what led to these bugs in the first place.

The one remaining plain sg_mark_end() call is in scsi_alloc_sgtable()
and arguably it could be converted to __sg_mark_end() if only so that
we can delete this confusing interface from linux/scatterlist.h

Signed-off-by: David S. Miller

David S. Miller
2007-10-31 12:29:29 +0800

30 Oct, 2007

1 commit

b0a713e9e [TCP] MD5: Remove some more unnecessary casting. ... Browse Code »

while reviewing the tcp_md5-related code further i came across with
another two of these casts which you probably have missed. I don't
actually think that they impose a problem by now, but as you said we
should remove them.

Signed-off-by: Matthias M. Dellweg
Signed-off-by: David S. Miller

Matthias M. Dellweg
2007-10-30 13:37:27 +0800

26 Oct, 2007

1 commit

c7da57a18 [TCP]: Fix scatterlist handling in MD5 signature support. ... Browse Code »

Use sg_init_table() and sg_mark_end() as needed.

Signed-off-by: David S. Miller

David S. Miller
2007-10-26 15:41:21 +0800

11 Oct, 2007

2 commits

227b60f51 [INET]: local port range robustness ... Browse Code »

Expansion of original idea from Denis V. Lunev

Add robustness and locking to the local_port_range sysctl.
1. Enforce that low < high when setting.
2. Use seqlock to ensure atomic update.

The locking might seem like overkill, but there are
cases where sysadmin might want to change value in the
middle of a DoS attack.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2007-10-11 08:30:46 +0800
457c4cbc5 [NET]: Make /proc/net per network namespace ... Browse Code »

This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.

Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.

Signed-off-by: Eric W. Biederman
Signed-off-by: David S. Miller

Eric W. Biederman
2007-10-11 07:49:06 +0800

29 Sep, 2007

1 commit

f8ab18d2d [TCP]: Fix MD5 signature handling on big-endian. ... Browse Code »

Based upon a report and initial patch by Peter Lieven.

tcp4_md5sig_key and tcp6_md5sig_key need to start with
the exact same members as tcp_md5sig_key. Because they
are both cast to that type by tcp_v{4,6}_md5_do_lookup().

Unfortunately tcp{4,6}_md5sig_key use a u16 for the key
length instead of a u8, which is what tcp_md5sig_key
uses. This just so happens to work by accident on
little-endian, but on big-endian it doesn't.

Instead of casting, just place tcp_md5sig_key as the first member of
the address-family specific structures, adjust the access sites, and
kill off the ugly casts.

Signed-off-by: David S. Miller

David S. Miller
2007-09-29 06:18:35 +0800

03 Aug, 2007

1 commit

3516ffb0f [TCP]: Invoke tcp_sendmsg() directly, do not use inet_sendmsg(). ... Browse Code »

As discovered by Evegniy Polyakov, if we try to sendmsg after
a connection reset, we can do incredibly stupid things.

The core issue is that inet_sendmsg() tries to autobind the
socket, but we should never do that for TCP. Instead we should
just go straight into TCP's sendmsg() code which will do all
of the necessary state and pending socket error checks.

TCP's sendpage already directly vectors to tcp_sendpage(), so this
merely brings sendmsg() in line with that.

Signed-off-by: David S. Miller

David S. Miller
2007-08-03 10:42:28 +0800

11 Jul, 2007

1 commit

a7ab4b501 [TCPv4]: Improve BH latency in /proc/net/tcp ... Browse Code »

Currently the code for /proc/net/tcp disable BH while iterating
over the entire established hash table. Even though we call
cond_resched_softirq for each entry, we still won't process
softirq's as regularly as we would otherwise do which results
in poor performance when the system is loaded near capacity.

This anomaly comes from the 2.4 code where this was all in a
single function and the local_bh_disable might have made sense
as a small optimisation.

The cost of each local_bh_disable is so small when compared
against the increased latency in keeping it disabled over a
large but mostly empty TCP established hash table that we
should just move it to the individual read_lock/read_unlock
calls as we do in inet_diag.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2007-07-11 13:06:20 +0800

13 Jun, 2007

1 commit

3d7dbeac5 [TCP]: Disable TSO if MD5SIG is enabled. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2007-06-13 05:36:42 +0800

08 Jun, 2007

1 commit

f0e48dbfc [TCP]: Honour sk_bound_dev_if in tcp_v4_send_ack ... Browse Code »

A time_wait socket inherits sk_bound_dev_if from the original socket,
but it is not used when sending ACK packets using ip_send_reply.

Fix by passing the oif to ip_send_reply in struct ip_reply_arg and
use it for output routing.

Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller

Patrick McHardy
2007-06-08 04:38:51 +0800

04 Jun, 2007

1 commit

584bdf8cb [IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP ... Browse Code »

Signed-off-by: Wei Dong
Signed-off-by: David S. Miller

Wei Dong
2007-06-04 09:08:50 +0800

26 Apr, 2007

10 commits

604763722 [NET]: Treat CHECKSUM_PARTIAL as CHECKSUM_UNNECESSARY ... Browse Code »

When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY. Therefore we should
treat it as such in the stack.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2007-04-26 13:28:43 +0800
663ead3bb [NET]: Use csum_start offset instead of skb_transport_header ... Browse Code »

The skb transport pointer is currently used to specify the start
of the checksum region for transmit checksum offload. Unfortunately,
the same pointer is also used during receive side processing.

This creates a problem when we want to retransmit a received
packet with partial checksums since the skb transport pointer
would be overwritten.

This patch solves this problem by creating a new 16-bit csum_start
offset value to replace the skb transport header for the purpose
of checksums. This offset is calculated from skb->head so that
it does not have to change when skb->data changes.

No extra space is required since csum_offset itself fits within
a 16-bit word so we can use the other 16 bits for csum_start.

For backwards compatibility, just before we push a packet with
partial checksums off into the device driver, we set the skb
transport header to what it would have been under the old scheme.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2007-04-26 13:28:40 +0800
4103f8cd5 [TCP]: tcp_memory_pressure and tcp_socket are__read_mostly candidates ... Browse Code »

tcp_memory_pressure and tcp_socket currently share a cache line with tcp_memory_allocated, tcp_sockets_allocated.
(Very hot cache line)
It makes sense to declare these variables as __read_mostly, to avoid false sharing on SMP.

ffffffff8081d9c0 B tcp_orphan_count
ffffffff8081d9c4 B tcp_memory_allocated
ffffffff8081d9c8 B tcp_sockets_allocated
ffffffff8081d9cc B tcp_memory_pressure
ffffffff8081d9d0 b tcp_md5sig_users
ffffffff8081d9d8 b tcp_md5sig_pool
ffffffff8081d9e0 b warntime.31570
ffffffff8081d9e8 b tcp_socket

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2007-04-26 13:28:19 +0800
aa8223c7b [SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th ... Browse Code »

Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Arnaldo Carvalho de Melo
2007-04-26 13:25:26 +0800
ab6a5bb6b [TCP]: Introduce tcp_hdrlen() and tcp_optlen() ... Browse Code »

The ip_hdrlen() buddy, created to reduce the number of skb->h.th-> uses and to
avoid the longer, open coded equivalent.

Ditched a no-op in bnx2 in the process.

I wonder if we should have a BUG_ON(skb->h.th->doff < 5) in tcp_optlen()...

Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Arnaldo Carvalho de Melo
2007-04-26 13:25:24 +0800
88c7664f1 [SK_BUFF]: Introduce icmp_hdr(), remove skb->h.icmph ... Browse Code »

Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Arnaldo Carvalho de Melo
2007-04-26 13:25:23 +0800
eddc9ec53 [SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph ... Browse Code »

Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Arnaldo Carvalho de Melo
2007-04-26 13:25:10 +0800
fe067e8ab [TCP]: Abstract out all write queue operations. ... Browse Code »

This allows the write queue implementation to be changed,
for example, to one which allows fast interval searching.

Signed-off-by: David S. Miller

David S. Miller
2007-04-26 13:24:02 +0800
9d729f72d [NET]: Convert xtime.tv_sec to get_seconds() ... Browse Code »

Where appropriate, convert references to xtime.tv_sec to the
get_seconds() helper function.

Signed-off-by: James Morris
Signed-off-by: David S. Miller

James Morris
2007-04-26 13:23:32 +0800
cf4c6bf83 [TCP]: struct *sock argument renamed: sp -> sk ... Browse Code »

In general, TCP code uses "sk" for struct sock pointer.

Signed-off-by: Ilpo Järvinen
Signed-off-by: David S. Miller

Ilpo Järvinen
2007-04-26 13:23:20 +0800

11 Feb, 2007

1 commit

e905a9eda [NET] IPV4: Fix whitespace errors. ... Browse Code »

Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

YOSHIFUJI Hideaki
2007-02-11 15:19:39 +0800

09 Feb, 2007

3 commits

dbca9b275 [NET]: change layout of ehash table ... Browse Code »

ehash table layout is currently this one :

First half of this table is used by sockets not in TIME_WAIT state
Second half of it is used by sockets in TIME_WAIT state.

This is non optimal because of for a given hash or socket, the two chain heads
are located in separate cache lines.
Moreover the locks of the second half are never used.

If instead of this halving, we use two list heads in inet_ehash_bucket instead
of only one, we probably can avoid one cache miss, and reduce ram usage,
particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK,
CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep
together related chains to speedup lookups and socket state change.

In this patch I did not try to align struct inet_ehash_bucket, but a future
patch could try to make this structure have a convenient size (a power of two
or a multiple of L1_CACHE_SIZE).
I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so
maybe we dont need to scratch our heads to align the bucket...

Note : In case struct inet_ehash_bucket is not a power of two, we could
probably change alloc_large_system_hash() (in case it use __get_free_pages())
to free the unused space. It currently allocates a big zone, but the last
quarter of it could be freed. Again, this should be a temporary 'problem'.

Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2007-02-09 06:16:46 +0800
8eb9086f2 [IPV4/IPV6]: Always wait for IPSEC SA resolution in socket contexts. ... Browse Code »

Do this even for non-blocking sockets. This avoids the silly -EAGAIN
that applications can see now, even for non-blocking sockets in some
cases (f.e. connect()).

With help from Venkat Tekkirala.

Signed-off-by: David S. Miller

David S. Miller
2007-02-09 04:38:45 +0800
ba7808eac [TCP]: remove tcp header from tcp_v4_check (take #2) ... Browse Code »

The tcphdr struct passed to tcp_v4_check is not used, the following
patch removes it from the parameter list.

This adds the netfilter modifications missing in the patch I sent
for rc3-mm1.

Signed-off-by: Frederik Deweerdt
Signed-off-by: David S. Miller

Frederik Deweerdt
2007-02-09 04:38:44 +0800

09 Jan, 2007

1 commit

cb48cfe80 [TCP]: Fix iov_len calculation in tcp_v4_send_ack(). ... Browse Code »

This fixes the ftp stalls present in the current kernels.

All credit goes to Komuro for tracking
this down. The patch is untested but it looks *cough* obviously
correct.

Signed-off-by: Craig Schlenter
Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

Craig Schlenter
2007-01-09 16:30:08 +0800

18 Dec, 2006

2 commits

a9fc00cca [TCP]: Trivial fix to message in tcp_v4_inbound_md5_hash ... Browse Code »

The message logged in tcp_v4_inbound_md5_hash when the hash was expected
but not found was reversed.

Signed-off-by: Leigh Brown
Signed-off-by: David S. Miller

Leigh Brown
2006-12-18 13:59:26 +0800
8228a18dd [TCP]: Fix oops caused by tcp_v4_md5_do_del ... Browse Code »

md5sig_info.alloced4 must be set to zero when freeing keys4, otherwise
it will not be alloc'd again when another key is added to the same
socket by tcp_v4_md5_do_add.

Signed-off-by: Leigh Brown
Signed-off-by: David S. Miller

Leigh Brown
2006-12-18 13:59:25 +0800

03 Dec, 2006

9 commits

b6332e6cf [TCP]: Fix warnings with TCP_MD5SIG disabled. ... Browse Code »

Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller

Andrew Morton
2006-12-03 13:31:52 +0800
f5b99bcdd [NET]: Possible cleanups. ... Browse Code »

This patch contains the following possible cleanups:
- make the following needlessly global functions statis:
- ipv4/tcp.c: __tcp_alloc_md5sig_pool()
- ipv4/tcp_ipv4.c: tcp_v4_reqsk_md5_lookup()
- ipv4/udplite.c: udplite_rcv()
- ipv4/udplite.c: udplite_err()
- make the following needlessly global structs static:
- ipv4/tcp_ipv4.c: tcp_request_sock_ipv4_ops
- ipv4/tcp_ipv4.c: tcp_sock_ipv4_specific
- ipv6/tcp_ipv6.c: tcp_request_sock_ipv6_ops
- net/ipv{4,6}/udplite.c: remove inline's from static functions
(gcc should know best when to inline them)

Signed-off-by: Adrian Bunk
Signed-off-by: David S. Miller

Adrian Bunk
2006-12-03 13:31:51 +0800
08dd1a506 [TCP] MD5SIG: Kill CONFIG_TCP_MD5SIG_DEBUG. ... Browse Code »

It just obfuscates the code and adds limited value. And as Adrian
Bunk noticed, it lacked Kconfig help text too, so just kill it.

Signed-off-by: David S. Miller

David S. Miller
2006-12-03 13:31:47 +0800
ff1dcadb1 [NET]: Split skb->csum ... Browse Code »

... into anonymous union of __wsum and __u32 (csum and csum_offset resp.)

Signed-off-by: Al Viro
Signed-off-by: David S. Miller

Al Viro
2006-12-03 13:27:18 +0800
8e5200f54 [NET]: Fix assorted misannotations (from md5 and udplite merges). ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: David S. Miller

Al Viro
2006-12-03 13:27:16 +0800
f6685938f [TCP_IPV4]: Use kmemdup where appropriate ... Browse Code »

Also use a variable to avoid the longish tp->md5sig_info-> use
in tcp_v4_md5_do_add.

Code diff stats:

[acme@newtoy net-2.6.20]$ codiff /tmp/tcp_ipv4.o.before /tmp/tcp_ipv4.o.after
/pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv4/tcp_ipv4.c:
tcp_v4_md5_do_add | -62
tcp_v4_syn_recv_sock | -32
tcp_v4_parse_md5_keys | -86
3 functions changed, 180 bytes removed
[acme@newtoy net-2.6.20]$

Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2006-12-03 13:23:54 +0800
7174259e6 [TCP_IPV4]: CodingStyle cleanups, no code change ... Browse Code »

Mostly related to CONFIG_TCP_MD5SIG recent merge.

Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2006-12-03 13:23:53 +0800
b51655b95 [NET]: Annotate __skb_checksum_complete() and friends. ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: David S. Miller

Al Viro
2006-12-03 13:23:38 +0800
714e85be3 [IPV6]: Assorted trivial endianness annotations. ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: David S. Miller

Al Viro
2006-12-03 13:22:50 +0800