Eric Lee / smarc-fsl-linux-kernel

14 Sep, 2013

1 commit

a22eb149b net: ipv6: tcp: fix potential use after free in tcp_v6_do_rcv ... Browse Code »

[ Upstream commit 3a1c756590633c0e86df606e5c618c190926a0df ]

In tcp_v6_do_rcv() code, when processing pkt options, we soley work
on our skb clone opt_skb that we've created earlier before entering
tcp_rcv_established() on our way. However, only in condition ...

if (np->rxopt.bits.rxtclass)
np->rcv_tclass = ipv6_get_dsfield(ipv6_hdr(skb));

... we work on skb itself. As we extract every other information out
of opt_skb in ipv6_pktoptions path, this seems wrong, since skb can
already be released by tcp_rcv_established() earlier on. When we try
to access it in ipv6_hdr(), we will dereference freed skb.

[ Bug added by commit 4c507d2897bd9b ("net: implement IP_RECVTOS for
IP_PKTOPTIONS") ]

Signed-off-by: Daniel Borkmann
Cc: Eric Dumazet
Acked-by: Eric Dumazet
Acked-by: Jiri Benc
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2013-09-14 21:54:56 +0800

12 May, 2013

1 commit

f77d60212 ipv6: do not clear pinet6 field ... Browse Code »

We have seen multiple NULL dereferences in __inet6_lookup_established()

After analysis, I found that inet6_sk() could be NULL while the
check for sk_family == AF_INET6 was true.

Bug was added in linux-2.6.29 when RCU lookups were introduced in UDP
and TCP stacks.

Once an IPv6 socket, using SLAB_DESTROY_BY_RCU is inserted in a hash
table, we no longer can clear pinet6 field.

This patch extends logic used in commit fcbdf09d9652c891
("net: fix nulls list corruptions in sk_prot_alloc")

TCP/UDP/UDPLite IPv6 protocols provide their own .clear_sk() method
to make sure we do not clear pinet6 field.

At socket clone phase, we do not really care, as cloning the parent (non
NULL) pinet6 is not adding a fatal race.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-05-12 07:26:38 +0800

30 Apr, 2013

1 commit

6a5dc9e59 net: Add MIB counters for checksum errors ... Browse Code »

Add MIB counters for checksum errors in IP layer,
and TCP/UDP/ICMP layers, to help diagnose problems.

$ nstat -a | grep Csum
IcmpInCsumErrors 72 0.0
TcpInCsumErrors 382 0.0
UdpInCsumErrors 463221 0.0
Icmp6InCsumErrors 75 0.0
Udp6InCsumErrors 173442 0.0
IpExtInCsumErrors 10884 0.0

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-04-30 03:14:03 +0800

08 Apr, 2013

2 commits

d978a6361 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/nfc/microread/mei.c
net/netfilter/nfnetlink_queue_core.c

Pull in 'net' to get Eric Biederman's AF_UNIX fix, upon which
some cleanups are going to go on-top.

Signed-off-by: David S. Miller

David S. Miller
2013-04-08 06:37:01 +0800
50a75a891 ipv6/tcp: Stop processing ICMPv6 redirect messages ... Browse Code »

Tetja Rediske found that if the host receives an ICMPv6 redirect message
after sending a SYN+ACK, the connection will be reset.

He bisected it down to 093d04d (ipv6: Change skb->data before using
icmpv6_notify() to propagate redirect), but the origin of the bug comes
from ec18d9a26 (ipv6: Add redirect support to all protocol icmp error
handlers.). The bug simply did not trigger prior to 093d04d, because
skb->data did not point to the inner IP header and thus icmpv6_notify
did not call the correct err_handler.

This patch adds the missing "goto out;" in tcp_v6_err. After receiving
an ICMPv6 Redirect, we should not continue processing the ICMP in
tcp_v6_err, as this may trigger the removal of request-socks or setting
sk_err(_soft).

Reported-by: Tetja Rediske
Signed-off-by: Christoph Paasch
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Christoph Paasch
2013-04-08 00:36:08 +0800

21 Mar, 2013

1 commit

61816596d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull in the 'net' tree to get Daniel Borkmann's flow dissector
infrastructure change.

Signed-off-by: David S. Miller

David S. Miller
2013-03-21 00:46:26 +0800

19 Mar, 2013

1 commit

0d4f06086 tcp: dont handle MTU reduction on LISTEN socket ... Browse Code »

When an ICMP ICMP_FRAG_NEEDED (or ICMPV6_PKT_TOOBIG) message finds a
LISTEN socket, and this socket is currently owned by the user, we
set TCP_MTU_REDUCED_DEFERRED flag in listener tsq_flags.

This is bad because if we clone the parent before it had a chance to
clear the flag, the child inherits the tsq_flags value, and next
tcp_release_cb() on the child will decrement sk_refcnt.

Result is that we might free a live TCP socket, as reported by
Dormando.

IPv4: Attempt to release TCP socket in state 1

Fix this issue by testing sk_state against TCP_LISTEN early, so that we
set TCP_MTU_REDUCED_DEFERRED on appropriate sockets (not a LISTEN one)

This bug was introduced in commit 563d34d05786
(tcp: dont drop MTU reduction indications)

Reported-by: dormando
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-03-19 01:31:28 +0800

18 Mar, 2013

1 commit

1a2c6181c tcp: Remove TCPCT ... Browse Code »

TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.

As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:

Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.

before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second

Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Christoph Paasch
2013-03-18 02:35:13 +0800

14 Feb, 2013

1 commit

ee684b6f2 tcp: send packets with a socket timestamp ... Browse Code »

A socket timestamp is a sum of the global tcp_time_stamp and
a per-socket offset.

A socket offset is added in places where externally visible
tcp timestamp option is parsed/initialized.

Connections in the SYN_RECV state are not supported, global
tcp_time_stamp is used for them, because repair mode doesn't support
this state. In a future it can be implemented by the similar way
as for TIME_WAIT sockets.

Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Cc: Eric Dumazet
Cc: Pavel Emelyanov
Signed-off-by: Andrey Vagin
Signed-off-by: David S. Miller

Andrey Vagin
2013-02-14 02:22:16 +0800

06 Feb, 2013

1 commit

188d1f76d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/intel/e1000e/ethtool.c
drivers/net/vmxnet3/vmxnet3_drv.c
drivers/net/wireless/iwlwifi/dvm/tx.c
net/ipv6/route.c

The ipv6 route.c conflict is simple, just ignore the 'net' side change
as we fixed the same problem in 'net-next' by eliminating cached
neighbours from ipv6 routes.

The e1000e conflict is an addition of a new statistic in the ethtool
code, trivial.

The vmxnet3 conflict is about one change in 'net' removing a guarding
conditional, whilst in 'net-next' we had a netdev_info() conversion.

The iwlwifi conflict is dealing with a WARN_ON() conversion in
'net-next' vs. a revert happening in 'net'.

Signed-off-by: David S. Miller

David S. Miller
2013-02-06 03:12:20 +0800

05 Feb, 2013

1 commit

5f1e942cb tcp: ipv6: Update MIB counters for drops ... Browse Code »

This patch updates LINUX_MIB_LISTENDROPS and LINUX_MIB_LISTENOVERFLOWS in
tcp_v6_conn_request() and tcp_v6_err(). tcp_v6_conn_request() in particular can
drop SYNs for various reasons which are not currently tracked.

Signed-off-by: Vijay Subramanian
Signed-off-by: David S. Miller

Vijay Subramanian
2013-02-05 02:06:27 +0800

24 Jan, 2013

1 commit

5ba24953e soreuseport: TCP/IPv6 implementation ... Browse Code »

Motivation for soreuseport would be something like a web server
binding to port 80 running with multiple threads, where each thread
might have it's own listener socket. This could be done as an
alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads. In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets. We have seen the disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest. With so_reusport the distribution is
uniform.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2013-01-24 02:44:01 +0800

14 Jan, 2013

1 commit

e7219858a ipv6: Use ipv6_get_dsfield() instead of ipv6_tclass(). ... Browse Code »

Commit 7a3198a8 ("ipv6: helper function to get tclass") introduced
ipv6_tclass(), but similar function is already available as
ipv6_get_dsfield().

We might be able to call ipv6_tclass() from ipv6_get_dsfield(),
but it is confusing to have two versions.

Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

YOSHIFUJI Hideaki / 吉藤英明
2013-01-14 09:17:14 +0800

07 Jan, 2013

1 commit

5d134f1c1 tcp: make sysctl_tcp_ecn namespace aware ... Browse Code »

As per suggestion from Eric Dumazet this patch makes tcp_ecn sysctl
namespace aware. The reason behind this patch is to ease the testing
of ecn problems on the internet and allows applications to tune their
own use of ecn.

Cc: Eric Dumazet
Cc: David Miller
Cc: Stephen Hemminger
Signed-off-by: Hannes Frederic Sowa
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2013-01-07 13:09:56 +0800

15 Dec, 2012

1 commit

e337e24d6 inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock ... Browse Code »

If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:

unreferenced object 0xffff88022e8a92c0 (size 1592):
comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
hex dump (first 32 bytes):
0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] kmemleak_alloc+0x21/0x3e
[] kmem_cache_alloc+0xb5/0xc5
[] sk_prot_alloc.isra.53+0x2b/0xcd
[] sk_clone_lock+0x16/0x21e
[] inet_csk_clone_lock+0x10/0x7b
[] tcp_create_openreq_child+0x21/0x481
[] tcp_v4_syn_recv_sock+0x3a/0x23b
[] tcp_check_req+0x29f/0x416
[] tcp_v4_do_rcv+0x161/0x2bc
[] tcp_v4_rcv+0x6c9/0x701
[] ip_local_deliver_finish+0x70/0xc4
[] ip_local_deliver+0x4e/0x7f
[] ip_rcv_finish+0x1fc/0x233
[] ip_rcv+0x217/0x267
[] __netif_receive_skb+0x49e/0x553
[] netif_receive_skb+0x50/0x82

This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.

This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...

Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.

Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().

A similar approach is taken for dccp by calling dccp_done().

This is in the kernel since 093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.

Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Christoph Paasch
2012-12-15 02:14:07 +0800

23 Nov, 2012

1 commit

2b9164771 ipv6: adapt connect for repair move ... Browse Code »

This is work the same as for ipv4.

All other hacks about tcp repair are in common code for ipv4 and ipv6,
so this patch is enough for repairing ipv6 connections.

Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Cc: Pavel Emelyanov
Signed-off-by: Andrey Vagin
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Andrey Vagin
2012-11-23 04:30:14 +0800

16 Nov, 2012

4 commits

c6b641a4c ipv6: Pull IPv6 GSO registration out of the module ... Browse Code »

Sing GSO support is now separate, pull it out of the module
and make it its own init call.
Remove the cleanup functions as they are no longer called.

Signed-off-by: Vlad Yasevich
Signed-off-by: David S. Miller

Vlad Yasevich
2012-11-16 06:39:24 +0800
8663e02ab ipv6: Separate tcp offload functionality ... Browse Code »

Pull TCPv6 offload functionality into its won file in preparation
for moving it out of the module.

Signed-off-by: Vlad Yasevich
Signed-off-by: David S. Miller

Vlad Yasevich
2012-11-16 06:36:18 +0800
3336288a9 ipv6: Switch to using new offload infrastructure. ... Browse Code »

Switch IPv6 protocol to using the new GRO/GSO calls and data.

Signed-off-by: Vlad Yasevich
Signed-off-by: David S. Miller

Vlad Yasevich
2012-11-16 06:36:17 +0800
8ca896cfd ipv6: Add new offload registration infrastructure. ... Browse Code »

Create a new data structure for IPv6 protocols that holds GRO/GSO
callbacks and a new array to track the protocols that register GRO/GSO.

Signed-off-by: Vlad Yasevich
Signed-off-by: David S. Miller

Vlad Yasevich
2012-11-16 06:36:17 +0800

04 Nov, 2012

1 commit

e6c022a4f tcp: better retrans tracking for defer-accept ... Browse Code »

For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Cc: Julian Anastasov
Cc: Vijay Subramanian
Cc: Elliott Hughes
Cc: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2012-11-04 02:45:00 +0800

24 Oct, 2012

1 commit

f3f121359 ipv6: tcp: clean up tcp_v6_early_demux() icsk variable ... Browse Code »

Remove an icsk variable, which by convention should refer to an
inet_connection_sock rather than an inet_sock. In the process, make
the tcp_v6_early_demux() code and formatting a bit more like
tcp_v4_early_demux(), to ease comparisons and maintenance.

Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Neal Cardwell
2012-10-24 01:03:44 +0800

13 Oct, 2012

1 commit

4c6752584 tcp: resets are misrouted ... Browse Code »

After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no
sock case").. tcp resets are always lost, when routing is asymmetric.
Yes, backing out that patch will result in misrouting of resets for
dead connections which used interface binding when were alive, but we
actually cannot do anything here. What's died that's died and correct
handling normal unbound connections is obviously a priority.

Comment to comment:
> This has few benefits:
> 1. tcp_v6_send_reset already did that.

It was done to route resets for IPv6 link local addresses. It was a
mistake to do so for global addresses. The patch fixes this as well.

Actually, the problem appears to be even more serious than guaranteed
loss of resets. As reported by Sergey Soloviev , those
misrouted resets create a lot of arp traffic and huge amount of
unresolved arp entires putting down to knees NAT firewalls which use
asymmetric routing.

Signed-off-by: Alexey Kuznetsov

Alexey Kuznetsov
2012-10-13 01:52:40 +0800

02 Oct, 2012

1 commit

861b65010 tcp: gro: add checksuming helpers ... Browse Code »

skb with CHECKSUM_NONE cant currently be handled by GRO, and
we notice this deep in GRO stack in tcp[46]_gro_receive()

But there are cases where GRO can be a benefit, even with a lack
of checksums.

This preliminary work is needed to add GRO support
to tunnels.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-10-02 05:00:27 +0800

23 Sep, 2012

2 commits

016818d07 tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS ... Browse Code »

When taking SYNACK RTT samples for servers using TCP Fast Open, fix
the code to ensure that we only call tcp_valid_rtt_meas() after we
receive the ACK that completes the 3-way handshake.

Previously we were always taking an RTT sample in
tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
to take the RTT sample.

To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
already ensures that we only take a SYNACK RTT sample if snt_synack is
non-zero. To be careful, we only take a snt_synack timestamp when
a SYNACK transmit or retransmit succeeds.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-09-23 03:47:10 +0800
623df484a tcp: extract code to compute SYNACK RTT ... Browse Code »

In preparation for adding another spot where we compute the SYNACK
RTT, extract this code so that it can be shared.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-09-23 03:47:10 +0800

15 Sep, 2012

1 commit

b48b63a1f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
net/netfilter/nfnetlink_log.c
net/netfilter/xt_LOG.c

Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.

Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.

Signed-off-by: David S. Miller

David S. Miller
2012-09-15 23:43:53 +0800

06 Sep, 2012

1 commit

d013ef2ab tcp: fix possible socket refcount problem for ipv6 ... Browse Code »

commit 144d56e91044181ec0ef67aeca91e9a8b5718348
("tcp: fix possible socket refcount problem") is missing
the IPv6 part. As tcp_release_cb is shared by both protocols
we should hold sock reference for the TCP_MTU_REDUCED_DEFERRED
bit.

Signed-off-by: Julian Anastasov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Julian Anastasov
2012-09-06 05:16:25 +0800

01 Sep, 2012

1 commit

8336886f7 tcp: TCP Fast Open Server - support TFO listeners ... Browse Code »
43

This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu
Cc: Yuchung Cheng
Cc: Neal Cardwell
Cc: Eric Dumazet
Cc: Tom Herbert
Signed-off-by: David S. Miller

Jerry Chu
2012-09-01 08:02:19 +0800

25 Aug, 2012

1 commit

e6acb3848 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.

Signed-off-by: David S. Miller

David S. Miller
2012-08-25 06:54:37 +0800

23 Aug, 2012

1 commit

1304a7343 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2012-08-23 05:21:38 +0800

20 Aug, 2012

1 commit

fae6ef87f net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child() ... Browse Code »

This commit removes the sk_rx_dst_set calls from
tcp_create_openreq_child(), because at that point the icsk_af_ops
field of ipv6_mapped TCP sockets has not been set to its proper final
value.

Instead, to make sure we get the right sk_rx_dst_set variant
appropriate for the address family of the new connection, we have
tcp_v{4,6}_syn_recv_sock() directly call the appropriate function
shortly after the call to tcp_create_openreq_child() returns.

This also moves inet6_sk_rx_dst_set() to avoid a forward declaration
with the new approach.

Signed-off-by: Neal Cardwell
Reported-by: Artem Savkov
Cc: Eric Dumazet
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Neal Cardwell
2012-08-20 18:03:33 +0800

15 Aug, 2012

1 commit

a7cb5a49b userns: Print out socket uids in a user namespace aware fashion. ... Browse Code »

Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Cc: Arnaldo Carvalho de Melo
Cc: Sridhar Samudrala
Acked-by: Vlad Yasevich
Acked-by: David S. Miller
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-08-15 12:48:06 +0800

10 Aug, 2012

2 commits

63d02d157 net: tcp: ipv6_mapped needs sk_rx_dst_set method ... Browse Code »

commit 5d299f3d3c8a2fb (net: ipv6: fix TCP early demux) added a
regression for ipv6_mapped case.

[ 67.422369] SELinux: initialized (dev autofs, type autofs), uses
genfs_contexts
[ 67.449678] SELinux: initialized (dev autofs, type autofs), uses
genfs_contexts
[ 92.631060] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 92.631435] IP: [< (null)>] (null)
[ 92.631645] PGD 0
[ 92.631846] Oops: 0010 [#1] SMP
[ 92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
uhci_hcd
[ 92.634294] CPU 0
[ 92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
[ 92.634294] RIP: 0010:[] [< (null)>]
(null)
[ 92.634294] RSP: 0018:ffff880245fc7cb0 EFLAGS: 00010282
[ 92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
0000000000000000
[ 92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
ffff88024827ad00
[ 92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
0000000000000000
[ 92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
ffff880254735380
[ 92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
7fffffffffff0218
[ 92.634294] FS: 00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
knlGS:0000000000000000
[ 92.634294] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
00000000000007f0
[ 92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
task ffff880254b8cac0)
[ 92.634294] Stack:
[ 92.634294] ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
ffff880245fc7d68
[ 92.634294] ffffffff81385083 00000000001d2680 ffff8802547353a8
ffff880245fc7d18
[ 92.634294] ffffffff8105903a ffff88024827ad60 0000000000000002
00000000000000ff
[ 92.634294] Call Trace:
[ 92.634294] [] ? tcp_finish_connect+0x2c/0xfa
[ 92.634294] [] tcp_rcv_state_process+0x2b6/0x9c6
[ 92.634294] [] ? sched_clock_cpu+0xc3/0xd1
[ 92.634294] [] ? local_clock+0x2b/0x3c
[ 92.634294] [] tcp_v4_do_rcv+0x63a/0x670
[ 92.634294] [] release_sock+0x128/0x1bd
[ 92.634294] [] __inet_stream_connect+0x1b1/0x352
[ 92.634294] [] ? lock_sock_nested+0x74/0x7f
[ 92.634294] [] ? wake_up_bit+0x25/0x25
[ 92.634294] [] ? lock_sock_nested+0x74/0x7f
[ 92.634294] [] ? inet_stream_connect+0x22/0x4b
[ 92.634294] [] inet_stream_connect+0x33/0x4b
[ 92.634294] [] sys_connect+0x78/0x9e
[ 92.634294] [] ? sysret_check+0x1b/0x56
[ 92.634294] [] ? __audit_syscall_entry+0x195/0x1c8
[ 92.634294] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 92.634294] [] system_call_fastpath+0x16/0x1b
[ 92.634294] Code: Bad RIP value.
[ 92.634294] RIP [< (null)>] (null)
[ 92.634294] RSP
[ 92.634294] CR2: 0000000000000000
[ 92.648982] ---[ end trace 24e2bed94314c8d9 ]---
[ 92.649146] Kernel panic - not syncing: Fatal exception in interrupt

Fix this using inet_sk_rx_dst_set(), and export this function in case
IPv6 is modular.

Reported-by: Andrew Morton
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-10 11:56:09 +0800
a399a8053 time: jiffies_delta_to_clock_t() helper to the rescue ... Browse Code »

Various /proc/net files sometimes report crazy timer values, expressed
in clock_t units.

This happens when an expired timer delta (expires - jiffies) is passed
to jiffies_to_clock_t().

This function has an overflow in :

return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);

commit cbbc719fccdb8cb (time: Change jiffies_to_clock_t() argument type
to unsigned long) only got around the problem.

As we cant output negative values in /proc/net/tcp without breaking
various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
that caps the negative delta to a 0 value.

Signed-off-by: Eric Dumazet
Reported-by: Maciej Żenczykowski
Cc: Thomas Gleixner
Cc: Paul Gortmaker
Cc: Andrew Morton
Cc: hank
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-10 07:17:03 +0800

07 Aug, 2012

1 commit

5d299f3d3 net: ipv6: fix TCP early demux ... Browse Code »
43

IPv6 needs a cookie in dst_check() call.

We need to add rx_dst_cookie and provide a family independent
sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-07 04:33:21 +0800

01 Aug, 2012

2 commits

99a1dec70 net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket ... Browse Code »

Introduce sk_gfp_atomic(), this function allows to inject sock specific
flags to each sock related allocation. It is only used on allocation
paths that may be required for writing pages back to network storage.

[davem@davemloft.net: Use sk_gfp_atomic only when necessary]
Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Acked-by: David S. Miller
Cc: Neil Brown
Cc: Mike Christie
Cc: Eric B Munson
Cc: Eric Dumazet
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:46 +0800
c255a4580 memcg: rename config variables ... Browse Code »

Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Aneesh Kumar K.V
Cc: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-08-01 09:42:43 +0800

27 Jul, 2012

1 commit

c7109986d ipv6: Early TCP socket demux ... Browse Code »

This is the IPv6 missing bits for infrastructure added in commit
41063e9dd1195 (ipv4: Early TCP socket demux.)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-27 06:50:39 +0800

23 Jul, 2012

1 commit

563d34d05 tcp: dont drop MTU reduction indications ... Browse Code »

ICMP messages generated in output path if frame length is bigger than
mtu are actually lost because socket is owned by user (doing the xmit)

One example is the ipgre_tunnel_xmit() calling
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));

We had a similar case fixed in commit a34a101e1e6 (ipv6: disable GSO on
sockets hitting dst_allfrag).

Problem of such fix is that it relied on retransmit timers, so short tcp
sessions paid a too big latency increase price.

This patch uses the tcp_release_cb() infrastructure so that MTU
reduction messages (ICMP messages) are not lost, and no extra delay
is added in TCP transmits.

Reported-by: Maciej Żenczykowski
Diagnosed-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Cc: Nandita Dukkipati
Cc: Tom Herbert
Cc: Tore Anderson
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-23 15:58:46 +0800