Eric Lee / smarc-fsl-linux-kernel

13 Dec, 2011

3 commits

3dc43e3e4 per-netns ipv4 sysctl_tcp_mem ... Browse Code »
129

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa
Reviewed-by: KAMEZAWA Hiroyuki
CC: David S. Miller
CC: Eric W. Biederman
Signed-off-by: David S. Miller

Glauber Costa
2011-12-13 08:04:11 +0800
d1a4c0b37 tcp memory pressure controls ... Browse Code »

This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.

Signed-off-by: Glauber Costa
Reviewed-by: KAMEZAWA Hiroyuki
CC: Eric W. Biederman
Signed-off-by: David S. Miller

Glauber Costa
2011-12-13 08:04:10 +0800
180d8cd94 foundations of per-cgroup memory pressure controlling. ... Browse Code »
43

This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.

Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.

Signed-off-by: Glauber Costa
Reviewed-by: Hiroyouki Kamezawa
CC: David S. Miller
CC: Eric W. Biederman
CC: Eric Dumazet
Signed-off-by: David S. Miller

Glauber Costa
2011-12-13 08:04:10 +0800

01 Dec, 2011

1 commit

d8a6e65f8 tcp: inherit listener congestion control for passive cnx ... Browse Code »

Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
was ignored for its children sockets : right after accept() the
congestion control for new socket is the system default one.

This seems an oversight of the initial design (quoted from Stephen)

Based on prior investigation and patch from Rick.

Reported-by: Rick Jones
Signed-off-by: Eric Dumazet
CC: Stephen Hemminger
CC: Yuchung Cheng
Tested-by: Rick Jones
Signed-off-by: David S. Miller

Eric Dumazet
2011-12-01 05:55:26 +0800

17 Nov, 2011

1 commit

709e8697a tcp: clear xmit timers in tcp_v4_syn_recv_sock() ... Browse Code »

Simon Kirby reported divides by zero errors in __tcp_select_window()

This happens when inet_csk_route_child_sock() returns a NULL pointer :

We free new socket while we eventually armed keepalive timer in
tcp_create_openreq_child()

Fix this by a call to tcp_clear_xmit_timers()

[ This is a followup to commit 918eb39962dff (net: add missing
bh_unlock_sock() calls) ]

Reported-by: Simon Kirby
Signed-off-by: Eric Dumazet
Tested-by: Simon Kirby
Signed-off-by: David S. Miller

Eric Dumazet
2011-11-17 05:57:45 +0800

04 Nov, 2011

1 commit

918eb3996 net: add missing bh_unlock_sock() calls ... Browse Code »

Simon Kirby reported lockdep warnings and following messages :

[104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?

[104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?

Problem comes from commit 0e734419
(ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)

If inet_csk_route_child_sock() returns NULL, we should release socket
lock before freeing it.

Another lock imbalance exists if __inet_inherit_port() returns an error
since commit 093d282321da ( tproxy: fix hash locking issue when using
port redirection in __inet_inherit_port()) a backport is also needed for
>= 2.6.37 kernels.

Reported-by: Simon Kirby
Signed-off-by: Eric Dumazet
Tested-by: Eric Dumazet
CC: Balazs Scheidler
CC: KOVACS Krisztian
Reviewed-by: Thomas Gleixner
Tested-by: Simon Kirby
Signed-off-by: David S. Miller

Eric Dumazet
2011-11-04 06:06:18 +0800

02 Nov, 2011

1 commit

73cb88ecb net: make the tcp and udp file_operations for the /proc stuff const ... Browse Code »

the tcp and udp code creates a set of struct file_operations at runtime
while it can also be done at compile time, with the added benefit of then
having these file operations be const.

the trickiest part was to get the "THIS_MODULE" reference right; the naive
method of declaring a struct in the place of registration would not work
for this reason.

Signed-off-by: Arjan van de Ven
Signed-off-by: David S. Miller

Arjan van de Ven
2011-11-02 05:56:14 +0800

24 Oct, 2011

2 commits

66b13d99d ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT ... Browse Code »

There is a long standing bug in linux tcp stack, about ACK messages sent
on behalf of TIME_WAIT sockets.

In the IP header of the ACK message, we choose to reflect TOS field of
incoming message, and this might break some setups.

Example of things that were broken :
- Routing using TOS as a selector
- Firewalls
- Trafic classification / shaping

We now remember in timewait structure the inet tos field and use it in
ACK generation, and route lookup.

Notes :
- We still reflect incoming TOS in RST messages.
- We could extend MuraliRaja Muniraju patch to report TOS value in
netlink messages for TIME_WAIT sockets.
- A patch is needed for IPv6

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-10-24 15:06:21 +0800
318cf7aaa tcp: md5: add more const attributes ... Browse Code »

Now tcp_md5_hash_header() has a const tcphdr argument, we can add more
const attributes to callers.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-10-24 14:46:04 +0800

21 Oct, 2011

1 commit

cf533ea53 tcp: add const qualifiers where possible ... Browse Code »

Adding const qualifiers to pointers can ease code review, and spot some
bugs. It might allow compiler to optimize code further.

For example, is it legal to temporary write a null cksum into tcphdr
in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
temporary null value...

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-10-21 17:22:42 +0800

08 Oct, 2011

1 commit

88c5100c2 Merge branch 'master' of github.com:davem330/net ... Browse Code »

Conflicts:
net/batman-adv/soft-interface.c

David S. Miller
2011-10-08 01:38:43 +0800

05 Oct, 2011

1 commit

260fcbeb1 tcp: properly handle md5sig_pool references ... Browse Code »
1

tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers
only hold one reference to md5sig_pool. but tcp_v4_md5_do_add()
increases use count of md5sig_pool for each peer. This patch
makes tcp_v4_md5_do_add() only increases use count for the first
tcp md5sig peer.

Signed-off-by: Zheng Yan
Signed-off-by: David S. Miller

Yan, Zheng
2011-10-05 11:31:24 +0800

27 Sep, 2011

1 commit

b82d1bb4f tcp: unalias tcp_skb_cb flags and ip_dsfield ... Browse Code »
43

struct tcp_skb_cb contains a "flags" field containing either tcp flags
or IP dsfield depending on context (input or output path)

Introduce ip_dsfield to make the difference clear and ease maintenance.
If later we want to save space, we can union flags/ip_dsfield

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-09-27 14:20:08 +0800

22 Sep, 2011

1 commit

8decf8687 Merge branch 'master' of github.com:davem330/net ... Browse Code »

Conflicts:
MAINTAINERS
drivers/net/Kconfig
drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
drivers/net/ethernet/broadcom/tg3.c
drivers/net/wireless/iwlwifi/iwl-pci.c
drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
drivers/net/wireless/rt2x00/rt2800usb.c
drivers/net/wireless/wl12xx/main.c

David S. Miller
2011-09-22 15:23:13 +0800

16 Sep, 2011

1 commit

946cedccb tcp: Change possible SYN flooding messages ... Browse Code »

"Possible SYN flooding on port xxxx " messages can fill logs on servers.

Change logic to log the message only once per listener, and add two new
SNMP counters to track :

TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

TCPReqQFullDrop : number of times a SYN request was dropped because
syncookies were not enabled.

Based on a prior patch from Tom Herbert, and suggestions from David.

Signed-off-by: Eric Dumazet
CC: Tom Herbert
Signed-off-by: David S. Miller

Eric Dumazet
2011-09-16 02:49:43 +0800

18 Aug, 2011

1 commit

bdeab9919 rps: Add flag to skb to indicate rxhash is based on L4 tuple ... Browse Code »

The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet. This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2011-08-18 11:06:03 +0800

07 Aug, 2011

1 commit

6e5714eaf net: Compute protocol sequence numbers and fragment IDs using MD5. ... Browse Code »

Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation. So the periodic
regeneration and 8-bit counter have been removed. We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.

Reported-by: Dan Kaminsky
Tested-by: Willy Tarreau
Signed-off-by: David S. Miller

David S. Miller
2011-08-07 09:33:19 +0800

21 Jun, 2011

1 commit

9f6ec8d69 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
drivers/net/wireless/iwlwifi/iwl-agn-rxon.c
drivers/net/wireless/rtlwifi/pci.c
net/netfilter/ipvs/ip_vs_core.c

David S. Miller
2011-06-21 13:29:08 +0800

18 Jun, 2011

1 commit

1eddceadb net: rfs: enable RFS before first data packet is received ... Browse Code »

Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
> From: Ben Hutchings
> Date: Fri, 17 Jun 2011 00:50:46 +0100
>
> > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
> >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> >> goto discard;
> >>
> >> if (nsk != sk) {
> >> + sock_rps_save_rxhash(nsk, skb->rxhash);
> >> if (tcp_child_process(sk, nsk, skb)) {
> >> rsk = nsk;
> >> goto reset;
> >>
> >
> > I haven't tried this, but it looks reasonable to me.
> >
> > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar.
>
> Indeed ipv6 side needs the same fix.
>
> Eric please add that part and resubmit. And in fact I might stick
> this into net-2.6 instead of net-next-2.6
>

OK, here is the net-2.6 based one then, thanks !

[PATCH v2] net: rfs: enable RFS before first data packet is received

First packet received on a passive tcp flow is not correctly RFS
steered.

One sock_rps_record_flow() call is missing in inet_accept()

But before that, we also must record rxhash when child socket is setup.

Signed-off-by: Eric Dumazet
CC: Tom Herbert
CC: Ben Hutchings
CC: Jamal Hadi Salim
Signed-off-by: David S. Miller

Eric Dumazet
2011-06-18 03:27:31 +0800

09 Jun, 2011

1 commit

9ad7c049f tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side ... Browse Code »
43

This patch lowers the default initRTO from 3secs to 1sec per
RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
has been retransmitted, AND the TCP timestamp option is not on.

It also adds support to take RTT sample during 3WHS on the passive
open side, just like its active open counterpart, and uses it, if
valid, to seed the initRTO for the data transmission phase.

The patch also resets ssthresh to its initial default at the
beginning of the data transmission phase, and reduces cwnd to 1 if
there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

Signed-off-by: H.K. Jerry Chu
Signed-off-by: David S. Miller

Jerry Chu
2011-06-09 08:05:30 +0800

24 May, 2011

1 commit

71338aa7d net: convert %p usage to %pK ... Browse Code »

The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.

If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".

The supporting code for kptr_restrict and %pK are currently in the -mm
tree. This patch converts users of %p in net/ to %pK. Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.

Signed-off-by: Dan Rosenberg
Cc: James Morris
Cc: Eric Dumazet
Cc: Thomas Graf
Cc: Eugene Teo
Cc: Kees Cook
Cc: Ingo Molnar
Cc: David S. Miller
Cc: Peter Zijlstra
Cc: Eric Paris
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller

Dan Rosenberg
2011-05-24 13:13:12 +0800

19 May, 2011

3 commits

a48eff128 ipv4: Pass explicit destination address to rt_bind_peer(). ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2011-05-19 06:42:43 +0800
ed2361e66 ipv4: Pass explicit destination address to rt_get_peer(). ... Browse Code »

This will next trickle down to rt_bind_peer().

Signed-off-by: David S. Miller

David S. Miller
2011-05-19 06:38:54 +0800
6bd023f3d ipv4: Make caller provide flowi4 key to inet_csk_route_req(). ... Browse Code »

This way the caller can get at the fully resolved fl4->{daddr,saddr}
etc.

Signed-off-by: David S. Miller

David S. Miller
2011-05-19 06:32:03 +0800

11 May, 2011

1 commit

0a5ebb800 ipv4: Pass explicit daddr arg to ip_send_reply(). ... Browse Code »

This eliminates an access to rt->rt_src.

Signed-off-by: David S. Miller

David S. Miller
2011-05-11 04:32:46 +0800

09 May, 2011

3 commits

c5216cc70 tcp: Use cork flow info instead of rt->rt_dst in tcp_v4_get_peer() ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2011-05-09 06:28:29 +0800
0e7344199 ipv4: Use inet_csk_route_child_sock() in DCCP and TCP. ... Browse Code »
43

Operation order is now transposed, we first create the child
socket then we try to hook up the route.

Signed-off-by: David S. Miller

David S. Miller
2011-05-09 06:28:03 +0800
da905bd1d tcp: Use cork flow in tcp_v4_connect() ... Browse Code »

Since this is invoked from inet_stream_connect() the socket is locked
and therefore this usage is safe.

Signed-off-by: David S. Miller

David S. Miller
2011-05-09 04:18:54 +0800

29 Apr, 2011

3 commits

d4fb3d74d ipv4: Get route daddr from flow key in tcp_v4_connect(). ... Browse Code »

Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst

Signed-off-by: David S. Miller

David S. Miller
2011-04-29 14:50:32 +0800
4071cfff8 ipv4: Fetch route saddr from flow key in tcp_v4_connect(). ... Browse Code »

Now that output route lookups update the flow with
source address selection, we can fetch it from
fl4->saddr instead of rt->rt_src

Signed-off-by: David S. Miller

David S. Miller
2011-04-29 14:17:31 +0800
f6d8bd051 inet: add RCU protection to inet->opt ... Browse Code »

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet
Cc: Herbert Xu
Signed-off-by: David S. Miller

Eric Dumazet
2011-04-29 04:16:35 +0800

28 Apr, 2011

1 commit

2d7192d6c ipv4: Sanitize and simplify ip_route_{connect,newports}() ... Browse Code »

These functions are used together as a unit for route resolution
during connect(). They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.

It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.

Let the callers provide the on-stack flow object. That way we only
need to initialize it once in the ip_route_connect() call.

Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.

Also, describe why this set of facilities are needed and how it works
in a big comment.

Signed-off-by: David S. Miller
Reviewed-by: Eric Dumazet

David S. Miller
2011-04-28 04:59:04 +0800

23 Apr, 2011

1 commit

b71d1d426 inet: constify ip headers and in6_addr ... Browse Code »

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-04-23 02:04:14 +0800

03 Mar, 2011

1 commit

b23dd4fe4 ipv4: Make output route lookup return rtable directly. ... Browse Code »

Instead of on the stack.

Signed-off-by: David S. Miller

David S. Miller
2011-03-03 06:31:35 +0800

02 Mar, 2011

1 commit

abdf7e723 ipv4: Can final ip_route_connect() arg to boolean "can_sleep". ... Browse Code »

Since that's what the current vague "flags" thing means.

Signed-off-by: David S. Miller

David S. Miller
2011-03-02 06:15:24 +0800

25 Feb, 2011

1 commit

dca8b089c ipv4: Rearrange how ip_route_newports() gets port keys. ... Browse Code »

ip_route_newports() is the only place in the entire kernel that
cares about the port members in the routing cache entry's lookup
flow key.

Therefore the only reason we store an entire flow inside of the
struct rtentry is for this one special case.

Rewrite ip_route_newports() such that:

1) The caller passes in the original port values, so we don't need
to use the rth->fl.fl_ip_{s,d}port values to remember them.

2) The lookup flow is constructed by hand instead of being copied
from the routing cache entry's flow.

Signed-off-by: David S. Miller

David S. Miller
2011-02-25 05:38:12 +0800

21 Feb, 2011

1 commit

089c34827 tcp: Remove debug macro of TCP_CHECK_TIMER ... Browse Code »

Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.

Remove it to keep code clearer.

Signed-off-by: Shan Wei
Signed-off-by: David S. Miller

Shan Wei
2011-02-21 03:10:14 +0800

11 Feb, 2011

1 commit

7a71ed899 inetpeer: Abstract address representation further. ... Browse Code »

Future changes will add caching information, and some of
these new elements will be addresses.

Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.

Signed-off-by: David S. Miller

David S. Miller
2011-02-11 05:22:28 +0800

25 Jan, 2011

1 commit

fd0273c50 tcp: fix bug in listening_get_next() ... Browse Code »

commit a8b690f98baf9fb19 (tcp: Fix slowness in read /proc/net/tcp)
introduced a bug in handling of SYN_RECV sockets.

st->offset represents number of sockets found since beginning of
listening_hash[st->bucket].

We should not reset st->offset when iterating through
syn_table[st->sbucket], or else if more than ~25 sockets (if
PAGE_SIZE=4096) are in SYN_RECV state, we exit from listening_get_next()
with a too small st->offset

Next time we enter tcp_seek_last_pos(), we are not able to seek past
already found sockets.

Reported-by: PK
CC: Tom Herbert
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-01-25 06:41:20 +0800

27 Dec, 2010

1 commit

17f7f4d9f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
net/ipv4/fib_frontend.c

David S. Miller
2010-12-27 14:37:05 +0800