Doug / smarc-fsl-linux-kernel | Embedian Git Server

04 Dec, 2009

2 commits

49d090078 tcp: diag: Dont report negative values for rx queue ... Browse Code »

Both netlink and /proc/net/tcp interfaces can report transient
negative values for rx queue.

ss ->
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB -6 6 127.0.0.1:45956 127.0.0.1:3333

netstat ->
tcp 4294967290 6 127.0.0.1:37784 127.0.0.1:3333 ESTABLISHED

This is because we dont lock socket while computing
tp->rcv_nxt - tp->copied_seq,
and another CPU can update copied_seq before rcv_next in RX path.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-12-04 08:06:13 +0800
b099ce260 net: Batch inet_twsk_purge ... Browse Code »

This function walks the whole hashtable so there is no point in
passing it a network namespace. Instead I purge all timewait
sockets from dead network namespaces that I find. If the namespace
is one of the once I am trying to purge I am guaranteed no new timewait
sockets can be formed so this will get them all. If the namespace
is one I am not acting for it might form a few more but I will
call inet_twsk_purge again and shortly to get rid of them. In
any even if the network namespace is dead timewait sockets are
useless.

Move the calls of inet_twsk_purge into batch_exit routines so
that if I am killing a bunch of namespaces at once I will just
call inet_twsk_purge once and save a lot of redundant unnecessary
work.

My simple 4k network namespace exit test the cleanup time dropped from
roughly 8.2s to 1.6s. While the time spent running inet_twsk_purge fell
to about 2ms. 1ms for ipv4 and 1ms for ipv6.

Signed-off-by: Eric W. Biederman
Signed-off-by: David S. Miller

Eric W. Biederman
2009-12-04 04:23:47 +0800

03 Dec, 2009

3 commits

4957faade TCPCT part 1g: Responder Cookie => Initiator ... Browse Code »

Parse incoming TCP_COOKIE option(s).

Calculate TCP_COOKIE option.

Send optional data.

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

http://thread.gmane.org/gmane.linux.network/102586

Requires:
TCPCT part 1a: add request_values parameter for sending SYNACK
TCPCT part 1b: generate Responder Cookie secret
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
TCPCT part 1d: define TCP cookie option, extend existing struct's
TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS
TCPCT part 1f: Initiator Cookie => Responder

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller

William Allen Simpson
2009-12-03 14:07:26 +0800
435cf559f TCPCT part 1d: define TCP cookie option, extend existing struct's ... Browse Code »

Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration). There are no additions to
tcp_request_sock, and only 1 pointer in tcp_sock.

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

http://thread.gmane.org/gmane.linux.network/102586

The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data. This is more flexible and
less subject to user configuration error. Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.

"Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

"Re: what a new TCP header might look like", May 12, 1998.
ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

These functions will also be used in subsequent patches that implement
additional features.

Requires:
TCPCT part 1a: add request_values parameter for sending SYNACK
TCPCT part 1b: generate Responder Cookie secret
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller

William Allen Simpson
2009-12-03 14:07:25 +0800
e6b4d1136 TCPCT part 1a: add request_values parameter for sending SYNACK ... Browse Code »

Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission. Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.

Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

William Allen Simpson
2009-12-03 14:07:23 +0800

14 Nov, 2009

2 commits

2c1409a0a inetpeer: Optimize inet_getid() ... Browse Code »

While investigating for network latencies, I found inet_getid() was a
contention point for some workloads, as inet_peer_idlock is shared
by all inet_getid() users regardless of peers.

One way to fix this is to make ip_id_count an atomic_t instead
of __u16, and use atomic_add_return().

In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
tcp_ts_stamp is also converted to __u32 instead of "unsigned long".

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-11-14 12:46:58 +0800
bee7ca9ec net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED ... Browse Code »

Define two symbols needed in both kernel and user space.

Remove old (somewhat incorrect) kernel variant that wasn't used in
most cases. Default should apply to both RMSS and SMSS (RFC2581).

Replace numeric constants with defined symbols.

Stand-alone patch, originally developed for TCPCT.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

William Allen Simpson
2009-11-14 12:38:48 +0800

29 Oct, 2009

1 commit

022c3f7d8 Allow tcp_parse_options to consult dst entry ... Browse Code »

We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings

Signed-off-by: Gilad Ben-Yossef
Sigend-off-by: Ori Finkelman
Sigend-off-by: Yony Amit
Signed-off-by: David S. Miller

Gilad Ben-Yossef
2009-10-29 16:28:41 +0800

19 Oct, 2009

1 commit

c720c7e83 inet: rename some inet_sock fields ... Browse Code »

In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-10-19 09:52:53 +0800

13 Oct, 2009

1 commit

f373b53b5 tcp: replace ehash_size by ehash_mask ... Browse Code »

Storing the mask (size - 1) instead of the size allows fast path to be
a bit faster.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-10-13 18:44:02 +0800

15 Sep, 2009

1 commit

0b6a05c1d tcp: fix ssthresh u16 leftover ... Browse Code »

It was once upon time so that snd_sthresh was a 16-bit quantity.
...That has not been true for long period of time. I run across
some ancient compares which still seem to trust such legacy.
Put all that magic into a single place, I hopefully found all
of them.

Compile tested, though linking of allyesconfig is ridiculous
nowadays it seems.

Signed-off-by: Ilpo Järvinen
Signed-off-by: David S. Miller

Ilpo Järvinen
2009-09-15 16:30:10 +0800

03 Sep, 2009

1 commit

aa1330766 tcp: replace hard coded GFP_KERNEL with sk_allocation ... Browse Code »

This fixed a lockdep warning which appeared when doing stress
memory tests over NFS:

inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock

mount_root => nfs_root_data => tcp_close => lock sk_lock =>
tcp_send_fin => alloc_skb_fclone => page reclaim

David raised a concern that if the allocation fails in tcp_send_fin(), and it's
GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
for the allocation to succeed.

But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
loop endlessly under memory pressure.

CC: Arnaldo Carvalho de Melo
CC: David S. Miller
CC: Herbert Xu
Signed-off-by: Wu Fengguang
Signed-off-by: David S. Miller

Wu Fengguang
2009-09-03 14:45:45 +0800

02 Sep, 2009

2 commits

3b401a81c inet: inet_connection_sock_af_ops const ... Browse Code »

The function block inet_connect_sock_af_ops contains no data
make it constant.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2009-09-02 16:03:49 +0800
b2e4b3deb tcp: MD5 operations should be const ... Browse Code »

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2009-09-02 16:03:43 +0800

01 Sep, 2009

2 commits

f1ecd5d9e Revert Backoff [v3]: Revert RTO on ICMP destination unreachable ... Browse Code »

Here, an ICMP host/network unreachable message, whose payload fits to
TCP's SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path.
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.

This patch reverts one RTO backoff, if an ICMP host/network unreachable
message, whose payload fits to TCP's SND.UNA, arrives.
Based on the new RTO, the retransmission timer is reset to reflect the
remaining time, or - if the revert clocked out the timer - a retransmission
is sent out immediately.
Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
there have been retransmissions and reversible backoffs, already.

Changes from v2:
1) Renaming of skb in tcp_v4_err() moved to another patch.
2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
3) Fixed code comments.

Signed-off-by: Damian Lukowski
Acked-by: Ilpo Järvinen
Signed-off-by: David S. Miller

Damian Lukowski
2009-09-01 17:45:42 +0800
4d1a2d9ec Revert Backoff [v3]: Rename skb to icmp_skb in tcp_v4_err() ... Browse Code »

This supplementary patch renames skb to icmp_skb in tcp_v4_err() in order to
disambiguate from another sk_buff variable, which will be introduced
in a separate patch.

Signed-off-by: Damian Lukowski
Acked-by: Ilpo Järvinen
Signed-off-by: David S. Miller

Damian Lukowski
2009-09-01 17:45:38 +0800

20 Jul, 2009

2 commits

e547bc1ec tcp: Use correct peer adr when copying MD5 keys ... Browse Code »

When the TCP connection handshake completes on the passive
side, a variety of state must be set up in the "child" sock,
including the key if MD5 authentication is being used. Fix TCP
for both address families to label the key with the peer's
destination address, rather than the address from the listening
sock, which is usually the wildcard.

Reported-by: Stephen Hemminger
Signed-off-by: John Dykstra
Signed-off-by: David S. Miller

John Dykstra
2009-07-20 22:49:08 +0800
e3afe7b75 tcp: Fix MD5 signature checking on IPv4 mapped sockets ... Browse Code »

Fix MD5 signature checking so that an IPv4 active open
to an IPv6 socket can succeed. In particular, use the
correct address family's signature generation function
for the SYN/ACK.

Reported-by: Stephen Hemminger
Signed-off-by: John Dykstra
Signed-off-by: David S. Miller

John Dykstra
2009-07-20 22:49:07 +0800

03 Jun, 2009

2 commits

adf30907d net: skb->dst accessors ... Browse Code »

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-06-03 17:51:04 +0800
511c3f92a net: skb->rtable accessor ... Browse Code »

Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-06-03 17:51:02 +0800

06 May, 2009

1 commit

ae8d7f884 tcp:fix the code indent ... Browse Code »

Signed-off-by: Shan Wei
Signed-off-by: David S. Miller

Shan Wei
2009-05-06 03:29:47 +0800

27 Apr, 2009

1 commit

36e7b1b8d gro: Fix COMPLETE checksum handling ... Browse Code »

On a brand new GRO skb, we cannot call ip_hdr since the header
may lie in the non-linear area. This patch adds the helper
skb_gro_network_header to handle this.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2009-04-27 20:44:45 +0800

28 Mar, 2009

1 commit

284904aa7 lsm: Relocate the IPv4 security_inet_conn_request() hooks ... Browse Code »

The current placement of the security_inet_conn_request() hooks do not allow
individual LSMs to override the IP options of the connection's request_sock.
This is a problem as both SELinux and Smack have the ability to use labeled
networking protocols which make use of IP options to carry security attributes
and the inability to set the IP options at the start of the TCP handshake is
problematic.

This patch moves the IPv4 security_inet_conn_request() hooks past the code
where the request_sock's IP options are set/reset so that the LSM can safely
manipulate the IP options as needed. This patch intentionally does not change
the related IPv6 hooks as IPv6 based labeling protocols which use IPv6 options
are not currently implemented, once they are we will have a better idea of
the correct placement for the IPv6 hooks.

Signed-off-by: Paul Moore
Acked-by: David S. Miller
Signed-off-by: James Morris

Paul Moore
2009-03-28 12:01:36 +0800

12 Mar, 2009

1 commit

fc1ad92df tcp: allow timestamps even if SYN packet has tsval=0 ... Browse Code »

Some systems send SYN packets with apparently wrong RFC1323 timestamp
option values [timestamp tsval=0 tsecr=0].
It might be for security reasons (http://www.secuobs.com/plugs/25220.shtml )

Linux TCP stack ignores this option and sends back a SYN+ACK packet
without timestamp option, thus many TCP flows cannot use timestamps
and lose some benefit of RFC1323.

Other operating systems seem to not care about initial tsval value, and let
tcp flows to negotiate timestamp option.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-03-12 00:23:57 +0800

23 Feb, 2009

1 commit

6a1b3054d tcp: Like icmp use register_pernet_subsys ... Browse Code »

To remove the possibility of packets flying around when network
devices are being cleaned up use reisger_pernet_subsys instead of
register_pernet_device.

Signed-off-by: Eric W. Biederman
Acked-by: Denis V. Lunev
Signed-off-by: David S. Miller

Eric W. Biederman
2009-02-23 11:54:49 +0800

30 Jan, 2009

1 commit

86911732d gro: Avoid copying headers of unmerged packets ... Browse Code »

Unfortunately simplicity isn't always the best. The fraginfo
interface turned out to be suboptimal. The problem was quite
obvious. For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.

LRO didn't have this problem because it directly read the headers
from the frags structure.

This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it. Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2009-01-30 08:33:03 +0800

07 Jan, 2009

1 commit

f67b45999 net_dma: convert to dma_find_channel ... Browse Code »

Use the general-purpose channel allocation provided by dmaengine.

Reviewed-by: Andrew Morton
Signed-off-by: Dan Williams

Dan Williams
2009-01-07 02:38:15 +0800

30 Dec, 2008

1 commit

eb4dea585 net: Fix percpu counters deadlock ... Browse Code »

When we converted the protocol atomic counters such as the orphan
count and the total socket count deadlocks were introduced due to
the mismatch in BH status of the spots that used the percpu counter
operations.

Based on the diagnosis and patch by Peter Zijlstra, this patch
fixes these issues by disabling BH where we may be in process
context.

Reported-by: Jeff Kirsher
Tested-by: Ingo Molnar
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2008-12-30 15:04:08 +0800

16 Dec, 2008

1 commit

bf296b125 tcp: Add GRO support ... Browse Code »

This patch adds the TCP-specific portion of GRO. The criterion for
merging is extremely strict (the TCP header must match exactly apart
from the checksum) so as to allow refragmentation. Otherwise this
is pretty much identical to LRO, except that we support the merging
of ECN packets.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2008-12-16 15:43:36 +0800

26 Nov, 2008

1 commit

1748376b6 net: Use a percpu_counter for sockets_allocated ... Browse Code »

Instead of using one atomic_t per protocol, use a percpu_counter
for "sockets_allocated", to reduce cache line contention on
heavy duty network servers.

Note : We revert commit (248969ae31e1b3276fc4399d67ce29a5d81e6fd9
net: af_unix can make unix_nr_socks visbile in /proc),
since it is not anymore used after sock_prot_inuse_add() addition

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-11-26 13:16:35 +0800

24 Nov, 2008

1 commit

c25eb3bfb net: Convert TCP/DCCP listening hash tables to use RCU ... Browse Code »

This is the last step to be able to perform full RCU lookups
in __inet_lookup() : After established/timewait tables, we
add RCU lookups to listening hash table.

The only trick here is that a socket of a given type (TCP ipv4,
TCP ipv6, ...) can now flight between two different tables
(established and listening) during a RCU grace period, so we
must use different 'nulls' end-of-chain values for two tables.

We define a large value :

#define LISTENING_NULLS_BASE (1U << 29)

So that slots in listening table are guaranteed to have different
end-of-chain values than slots in established table. A reader can
still detect it finished its lookup in the right chain.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-11-24 09:22:55 +0800

21 Nov, 2008

1 commit

9db66bdcc net: convert TCP/DCCP ehash rwlocks to spinlocks ... Browse Code »

Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks.

/proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'.

This should speedup writers, since spin_lock()/spin_unlock()
only use one atomic operation instead of two for write_lock()/write_unlock()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-11-21 12:39:09 +0800

20 Nov, 2008

2 commits

5caea4ea7 net: listening_hash get a spinlock per bucket ... Browse Code »

This patch prepares RCU migration of listening_hash table for
TCP/DCCP protocols.

listening_hash table being small (32 slots per protocol), we add
a spinlock for each slot, instead of a single rwlock for whole table.

This should reduce hold time of readers, and writers concurrency.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-11-20 16:40:07 +0800
07f0757a6 include/net net/ - csum_partial - remove unnecessary casts ... Browse Code »

The first argument to csum_partial is const void *
casts to char/u8 * are not necessary

Signed-off-by: Joe Perches
Signed-off-by: David S. Miller

Joe Perches
2008-11-20 07:44:53 +0800

17 Nov, 2008

1 commit

3ab5aee7f net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls ... Browse Code »

RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-11-17 11:40:17 +0800

03 Nov, 2008

1 commit

5799de0b1 net: clean up net/ipv4/tcp_ipv4.c ... Browse Code »

Signed-off-by: Jianjun Kong
Signed-off-by: David S. Miller

Jianjun Kong
2008-11-03 18:49:10 +0800

31 Oct, 2008

1 commit

673d57e72 net: replace NIPQUAD() in net/ipv4/ net/ipv6/ ... Browse Code »

Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison
Signed-off-by: David S. Miller

Harvey Harrison
2008-10-31 15:53:57 +0800

10 Oct, 2008

1 commit

78e645cb8 tcpv[46]: fix md5 pseudoheader address field ordering ... Browse Code »

Maybe it's just me but I guess those md5 people made a mess
out of it by having *_md5_hash_* to use daddr, saddr order
instead of the one that is natural (and equal to what csum
functions use). For the segment were sending, the original
addresses are reversed so buff's saddr == skb's daddr and
vice-versa.

Maybe I can finally proceed with unification of some code
after fixing it first... :-)

Signed-off-by: Ilpo Järvinen
Signed-off-by: David S. Miller

Ilpo Järvinen
2008-10-10 05:37:47 +0800

09 Oct, 2008

1 commit

52cd5750e tcp: fix length used for checksum in a reset ... Browse Code »

While looking for some common code I came across difference
in checksum calculation between tcp_v6_send_(reset|ack) I
couldn't explain. I checked both v4 and v6 and found out that
both seem to have the same "feature". I couldn't find anything
in rfc nor anywhere else which would state that md5 option
should be ignored like it was in case of reset so I came to
a conclusion that this is probably a genuine bug. I suspect
that addition of md5 just was fooled by the excessive
copy-paste code in those functions and the reset part was
never tested well enough to find out the problem.

Signed-off-by: Ilpo Järvinen
Signed-off-by: David S. Miller

Ilpo Järvinen
2008-10-09 02:34:06 +0800

08 Oct, 2008

1 commit

9a1f27c48 inet_hashtables: Add inet_lookup_skb helpers ... Browse Code »

To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.

Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: KOVACS Krisztian
Signed-off-by: David S. Miller

Arnaldo Carvalho de Melo
2008-10-08 02:41:57 +0800