Doug / smarc-fsl-linux-kernel | Embedian Git Server

07 Feb, 2013

1 commit

6731d2095 tcp: fix for zero packets_in_flight was too broad ... Browse Code »

There are transients during normal FRTO procedure during which
the packets_in_flight can go to zero between write_queue state
updates and firing the resulting segments out. As FRTO processing
occurs during that window the check must be more precise to
not match "spuriously" :-). More specificly, e.g., when
packets_in_flight is zero but FLAG_DATA_ACKED is true the problematic
branch that set cwnd into zero would not be taken and new segments
might be sent out later.

Signed-off-by: Ilpo Järvinen
Tested-by: Eric Dumazet
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Ilpo Järvinen
2013-02-07 04:53:03 +0800

05 Feb, 2013

1 commit

848bf15f3 tcp: Update MIB counters for drops ... Browse Code »

This patch updates LINUX_MIB_LISTENDROPS in tcp_v4_conn_request() and
tcp_v4_err(). tcp_v4_conn_request() in particular can drop SYNs for various
reasons which are not currently tracked.

Signed-off-by: Vijay Subramanian
Signed-off-by: David S. Miller

Vijay Subramanian
2013-02-05 02:06:27 +0800

04 Feb, 2013

2 commits

2e5f42121 tcp: frto should not set snd_cwnd to 0 ... Browse Code »

Commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start())
uncovered a bug in FRTO code :
tcp_process_frto() is setting snd_cwnd to 0 if the number
of in flight packets is 0.

As Neal pointed out, if no packet is in flight we lost our
chance to disambiguate whether a loss timeout was spurious.

We should assume it was a proper loss.

Reported-by: Pasi Kärkkäinen
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Cc: Ilpo Järvinen
Cc: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2013-02-04 05:00:25 +0800
973ec449b tcp: fix an infinite loop in tcp_slow_start() ... Browse Code »

Since commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start()),
a nul snd_cwnd triggers an infinite loop in tcp_slow_start()

Avoid this infinite loop and log a one time error for further
analysis. FRTO code is suspected to cause this bug.

Reported-by: Pasi Kärkkäinen
Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2013-02-04 05:00:25 +0800

01 Feb, 2013

1 commit

66555e92f tcp: detect SYN/data drop when F-RTO is disabled ... Browse Code »

On receiving the SYN-ACK, Fast Open checks icsk_retransmit for SYN
retransmission to detect SYN/data drops. But if F-RTO is disabled,
icsk_retransmit is reset at step D of tcp_fastretrans_alert() (
under tcp_ack()) before tcp_rcv_fastopen_synack(). The fix is to use
total_retrans instead which accounts for SYN retransmission regardless
the use of F-RTO.

Signed-off-by: Yuchung Cheng
Signed-off-by: David S. Miller

Yuchung Cheng
2013-02-01 03:20:07 +0800

30 Jan, 2013

1 commit

2aeef18d3 tcp: Increment LISTENOVERFLOW and LISTENDROPS in tcp_v4_conn_request() ... Browse Code »

We drop a connection request if the accept backlog is full and there are
sufficient packets in the syn queue to warrant starting drops. Increment the
appropriate counters so this isn't silent, for accurate stats and help in
debugging.

This patch assumes LINUX_MIB_LISTENDROPS is a superset of/includes the
counter LINUX_MIB_LISTENOVERFLOWS.

Signed-off-by: Nivedita Singhvi
Acked-by: Vijay Subramanian
Signed-off-by: David S. Miller

Nivedita Singhvi
2013-01-30 04:43:04 +0800

28 Jan, 2013

1 commit

5465740ac IP_GRE: Fix kernel panic in IP_GRE with GRE csum. ... Browse Code »

Due to IP_GRE GSO support, GRE can recieve non linear skb which
results in panic in case of GRE_CSUM. Following patch fixes it by
using correct csum API.

Bug introduced in commit 6b78f16e4bdde3936b (gre: add GSO support)

Signed-off-by: Pravin B Shelar
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Pravin B Shelar
2013-01-28 13:07:34 +0800

23 Jan, 2013

2 commits

b44108dbd ipv4: Fix route refcount on pmtu discovery ... Browse Code »

git commit 9cb3a50c (ipv4: Invalidate the socket cached route on
pmtu events if possible) introduced a refcount problem. We don't
get a refcount on the route if we get it from__sk_dst_get(), but
we need one if we want to reuse this route because __sk_dst_set()
releases the refcount of the old route. This patch adds proper
refcount handling for that case. We introduce a 'new' flag to
indicate that we are going to use a new route and we release the
old route only if we replace it by a new one.

Reported-by: Julian Anastasov
Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2013-01-23 03:23:17 +0800
0c8729c9b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec ... Browse Code »

Steffen Klassert says:

====================
1) The transport header did not point to the right place after
esp/ah processing on tunnel mode in the receive path. As a
result, the ECN field of the inner header was not set correctly,
fixes from Li RongQing.

2) We did a null check too late in one of the xfrm_replay advance
functions. This can lead to a division by zero, fix from
Nickolai Zeldovich.

3) The size calculation of the hash table missed the muiltplication
with the actual struct size when the hash table is freed.
We might call the wrong free function, fix from Michal Kubecek.

4) On IPsec pmtu events we can't access the transport headers of
the original packet, so force a relookup for all routes
to notify about the pmtu event.
====================

Signed-off-by: David S. Miller

David S. Miller
2013-01-23 03:20:28 +0800

22 Jan, 2013

2 commits

8141ed9fc ipv4: Add a socket release callback for datagram sockets ... Browse Code »

This implements a socket release callback function to check
if the socket cached route got invalid during the time
we owned the socket. The function is used from udp, raw
and ping sockets.

Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2013-01-22 03:17:05 +0800
9cb3a50c5 ipv4: Invalidate the socket cached route on pmtu events if possible ... Browse Code »

The route lookup in ipv4_sk_update_pmtu() might return a route
different from the route we cached at the socket. This is because
standart routes are per cpu, so each cpu has it's own struct rtable.
This means that we do not invalidate the socket cached route if the
NET_RX_SOFTIRQ is not served by the same cpu that the sending socket
uses. As a result, the cached route reused until we disconnect.

With this patch we invalidate the socket cached route if possible.
If the socket is owened by the user, we can't update the cached
route directly. A followup patch will implement socket release
callback functions for datagram sockets to handle this case.

Reported-by: Yurij M. Plotnikov
Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2013-01-22 03:17:05 +0800

21 Jan, 2013

2 commits

05ab86c55 xfrm4: Invalidate all ipv4 routes on IPsec pmtu events ... Browse Code »

On IPsec pmtu events we can't access the transport headers of
the original packet, so we can't find the socket that sent
the packet. The only chance to notify the socket about the
pmtu change is to force a relookup for all routes. This
patch implenents this for the IPsec protocols.

Signed-off-by: Steffen Klassert

Steffen Klassert
2013-01-21 19:43:54 +0800
b74aa930e tcp: fix incorrect LOCKDROPPEDICMPS counter ... Browse Code »

commit 563d34d057 (tcp: dont drop MTU reduction indications)
added an error leading to incorrect accounting of
LINUX_MIB_LOCKDROPPEDICMPS

If socket is owned by the user, we want to increment
this SNMP counter, unless the message is a
(ICMP_DEST_UNREACH,ICMP_FRAG_NEEDED) one.

Reported-by: Maciej Żenczykowski
Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Signed-off-by: Maciej Żenczykowski
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2013-01-21 06:22:05 +0800

17 Jan, 2013

2 commits

fa1e492aa ipv4: Don't update the pmtu on mtu locked routes ... Browse Code »

Routes with locked mtu should not use learned pmtu informations,
so do not update the pmtu on these routes.

Reported-by: Julian Anastasov
Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2013-01-17 16:39:36 +0800
38d523e29 ipv4: Remove output route check in ipv4_mtu ... Browse Code »

The output route check was introduced with git commit 261663b0
(ipv4: Don't use the cached pmtu informations for input routes)
during times when we cached the pmtu informations on the
inetpeer. Now the pmtu informations are back in the routes,
so this check is obsolete. It also had some unwanted side effects,
as reported by Timo Teras and Lukas Tribus.

Signed-off-by: Steffen Klassert
Acked-by: Timo Teräs
Signed-off-by: David S. Miller

Steffen Klassert
2013-01-17 16:39:36 +0800

11 Jan, 2013

3 commits

7b514a886 tcp: accept RST without ACK flag ... Browse Code »

commit c3ae62af8e755 (tcp: should drop incoming frames without ACK flag
set) added a regression on the handling of RST messages.

RST should be allowed to come even without ACK bit set. We validate
the RST by checking the exact sequence, as requested by RFC 793 and
5961 3.2, in tcp_validate_incoming()

Reported-by: Eric Wong
Signed-off-by: Eric Dumazet
Acked-by: Neal Cardwell
Tested-by: Eric Wong
Signed-off-by: David S. Miller

Eric Dumazet
2013-01-11 14:49:30 +0800
f26845b43 tcp: fix splice() and tcp collapsing interaction ... Browse Code »

Under unusual circumstances, TCP collapse can split a big GRO TCP packet
while its being used in a splice(socket->pipe) operation.

skb_splice_bits() releases the socket lock before calling
splice_to_pipe().

[ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
[ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
[ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13

To fix this problem, we must eat skbs in tcp_recv_skb().

Remove the inline keyword from tcp_recv_skb() definition since
it has three call sites.

Reported-by: Christian Becker
Cc: Willy Tarreau
Signed-off-by: Eric Dumazet
Tested-by: Willy Tarreau
Signed-off-by: David S. Miller

Eric Dumazet
2013-01-11 06:09:57 +0800
ff905b1e4 tcp: splice: fix an infinite loop in tcp_read_sock() ... Browse Code »

commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers)
added a regression.

[ 83.843570] INFO: rcu_sched self-detected stall on CPU
[ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
[ 83.844582] Task dump for CPU 6:
[ 83.844584] netperf R running task 0 8966 8952 0x0000000c
[ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
[ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
[ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
[ 83.844594] Call Trace:
[ 83.844596] [] ? vprintk_emit+0x1c9/0x4c0
[ 83.844601] [] ? schedule+0x29/0x70
[ 83.844606] [] ? tcp_splice_data_recv+0x42/0x50
[ 83.844610] [] ? tcp_read_sock+0xda/0x260
[ 83.844613] [] ? tcp_prequeue_process+0xb0/0xb0
[ 83.844615] [] ? tcp_splice_read+0xc0/0x250
[ 83.844618] [] ? sock_splice_read+0x22/0x30
[ 83.844622] [] ? do_splice_to+0x7b/0xa0
[ 83.844627] [] ? sys_splice+0x59c/0x5d0
[ 83.844630] [] ? putname+0x2b/0x40
[ 83.844633] [] ? do_sys_open+0x174/0x1e0
[ 83.844636] [] ? system_call_fastpath+0x16/0x1b

if recv_actor() returns 0, we should stop immediately,
because looping wont give a chance to drain the pipe.

Signed-off-by: Eric Dumazet
Cc: Willy Tarreau
Signed-off-by: David S. Miller

Eric Dumazet
2013-01-11 06:07:19 +0800

09 Jan, 2013

1 commit

c9be4a5c4 net: prevent setting ttl=0 via IP_TTL ... Browse Code »

A regression is introduced by the following commit:

commit 4d52cfbef6266092d535237ba5a4b981458ab171
Author: Eric Dumazet
Date: Tue Jun 2 00:42:16 2009 -0700

net: ipv4/ip_sockglue.c cleanups

Pure cleanups

but it is not a pure cleanup...

- if (val != -1 && (val < 1 || val>255))
+ if (val != -1 && (val < 0 || val > 255))

Since there is no reason provided to allow ttl=0, change it back.

Reported-by: nitin padalia
Cc: nitin padalia
Cc: Eric Dumazet
Cc: David S. Miller
Signed-off-by: Cong Wang
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Cong Wang
2013-01-09 09:57:10 +0800

08 Jan, 2013

1 commit

7143dfac6 ah4/esp4: set transport header correctly for IPsec tunnel mode. ... Browse Code »

IPsec tunnel does not set ECN field to CE in inner header when
the ECN field in the outer header is CE, and the ECN field in
the inner header is ECT(0) or ECT(1).

The cause is ipip_hdr() does not return the correct address of
inner header since skb->transport-header is not the inner header
after esp_input_done2(), or ah_input().

Signed-off-by: Li RongQing
Signed-off-by: Steffen Klassert

Li RongQing
2013-01-08 19:41:30 +0800

07 Jan, 2013

1 commit

c7e2e1d72 ipv4: fix NULL checking in devinet_ioctl() ... Browse Code »

The NULL pointer check `!ifa' should come before its first use.

[ Bug origin : commit fd23c3b31107e2fc483301ee923d8a1db14e53f4
(ipv4: Add hash table of interface addresses) in linux-2.6.39 ]

Signed-off-by: Xi Wang
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Xi Wang
2013-01-07 13:11:18 +0800

05 Jan, 2013

1 commit

9dd4a13a8 net/ipv4/ipconfig: really display the BOOTP/DHCP server's address. ... Browse Code »

Up to now, the debug and info messages from the ipconfig subsytem
claim to display the IP address of the DHCP/BOOTP server but
display instead the IP address of the bootserver. Fix that.

Signed-off-by: Philippe De Muyter
Signed-off-by: David S. Miller

Philippe De Muyter
2013-01-05 07:14:14 +0800

29 Dec, 2012

1 commit

ac196f8c9 Merge branch 'master' of git://1984.lsi.us.es/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
The following batch contains Netfilter fixes for 3.8-rc1. They are
a mixture of old bugs that have passed unnoticed (I'll pass these to
stable) and more fresh ones from the previous merge window, they are:

* Fix for MAC address in 6in4 tunnels via NFLOG that results in ulogd
showing up wrong address, from Bob Hockney.

* Fix a comment in nf_conntrack_ipv6, from Florent Fourcot.

* Fix a leak an error path in ctnetlink while creating an expectation,
from Jesper Juhl.

* Fix missing ICMP time exceeded in the IPv6 defragmentation code, from
Haibo Xi.

* Fix inconsistent handling of routing changes in MASQUERADE for the
new connections case, from Andrew Collins.

* Fix a missing skb_reset_transport in ip[6]t_REJECT that leads to
crashes in the ixgbe driver (since it seems to access the transport
header with TSO enabled), from Mukund Jampala.

* Recover obsoleted NOTRACK target by including it into the CT and spot
a warning via printk about being obsoleted. Many people don't check the
scheduled to be removal file under Documentation, so we follow some
less agressive approach to kill this in a year or so. Spotted by Florian
Westphal, patch from myself.

* Fix race condition in xt_hashlimit that allows to create two or more
entries, from myself.

* Fix crash if the CT is used due to the recently added facilities to
consult the dying and unconfirmed conntrack lists, from myself.
====================

Signed-off-by: David S. Miller

David S. Miller
2012-12-29 06:28:17 +0800

27 Dec, 2012

2 commits

861aa6d56 ipv4/ip_gre: set transport header correctly to gre header ... Browse Code »

ipgre_tunnel_xmit() incorrectly sets transport header to inner payload
instead of GRE header. It seems copy-and-pasted from ipip.c.
So set transport header to gre header.
(In ipip case the transport header is the inner ip header, so that's
correct.)

Found by inspection. In practice the incorrect transport header
doesn't matter because the skb usually is sent to another net_device
or socket, so the transport header isn't referenced.

Signed-off-by: Isaku Yamahata
Signed-off-by: David S. Miller

Isaku Yamahata
2012-12-27 07:19:56 +0800
c3ae62af8 tcp: should drop incoming frames without ACK flag set ... Browse Code »

In commit 96e0bf4b5193d (tcp: Discard segments that ack data not yet
sent) John Dykstra enforced a check against ack sequences.

In commit 354e4aa391ed5 (tcp: RFC 5961 5.2 Blind Data Injection Attack
Mitigation) I added more safety tests.

But we missed fact that these tests are not performed if ACK bit is
not set.

RFC 793 3.9 mandates TCP should drop a frame without ACK flag set.

" fifth check the ACK field,
if the ACK bit is off drop the segment and return"

Not doing so permits an attacker to only guess an acceptable sequence
number, evading stronger checks.

Many thanks to Zhiyun Qian for bringing this issue to our attention.

See :
http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf

Reported-by: Zhiyun Qian
Signed-off-by: Eric Dumazet
Cc: Nandita Dukkipati
Cc: Neal Cardwell
Cc: John Dykstra
Signed-off-by: David S. Miller

Eric Dumazet
2012-12-27 07:08:55 +0800

25 Dec, 2012

1 commit

cf0be8805 arp: fix a regression in arp_solicit() ... Browse Code »

Sedat reported the following commit caused a regression:

commit 9650388b5c56578fdccc79c57a8c82fb92b8e7f1
Author: Eric Dumazet
Date: Fri Dec 21 07:32:10 2012 +0000

ipv4: arp: fix a lockdep splat in arp_solicit

This is due to the 6th parameter of arp_send() needs to be NULL
for the broadcast case, the above commit changed it to an all-zero
array by mistake.

Reported-by: Sedat Dilek
Tested-by: Sedat Dilek
Cc: Sedat Dilek
Cc: Eric Dumazet
Cc: David S. Miller
Cc: Julian Anastasov
Signed-off-by: Cong Wang
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Cong Wang
2012-12-25 10:42:58 +0800

22 Dec, 2012

3 commits

9650388b5 ipv4: arp: fix a lockdep splat in arp_solicit() ... Browse Code »

Yan Burman reported following lockdep warning :

=============================================
[ INFO: possible recursive locking detected ]
3.7.0+ #24 Not tainted
---------------------------------------------
swapper/1/0 is trying to acquire lock:
(&n->lock){++--..}, at: [] __neigh_event_send
+0x2e/0x2f0

but task is already holding lock:
(&n->lock){++--..}, at: [] arp_solicit+0x1d4/0x280

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&n->lock);
lock(&n->lock);

*** DEADLOCK ***

May be due to missing lock nesting notation

4 locks held by swapper/1/0:
#0: (((&n->timer))){+.-...}, at: []
call_timer_fn+0x0/0x1c0
#1: (&n->lock){++--..}, at: [] arp_solicit
+0x1d4/0x280
#2: (rcu_read_lock_bh){.+....}, at: []
dev_queue_xmit+0x0/0x5d0
#3: (rcu_read_lock_bh){.+....}, at: []
ip_finish_output+0x13e/0x640

stack backtrace:
Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
Call Trace:
[] validate_chain+0xdcc/0x11f0
[] ? __lock_acquire+0x440/0xc30
[] ? kmem_cache_free+0xe5/0x1c0
[] __lock_acquire+0x440/0xc30
[] ? inet_getpeer+0x40/0x600
[] ? __lock_acquire+0x440/0xc30
[] ? __neigh_event_send+0x2e/0x2f0
[] lock_acquire+0x95/0x140
[] ? __neigh_event_send+0x2e/0x2f0
[] ? __lock_acquire+0x440/0xc30
[] _raw_write_lock_bh+0x3b/0x50
[] ? __neigh_event_send+0x2e/0x2f0
[] __neigh_event_send+0x2e/0x2f0
[] neigh_resolve_output+0x16b/0x270
[] ip_finish_output+0x34d/0x640
[] ? ip_finish_output+0x13e/0x640
[] ? vxlan_xmit+0x556/0xbec [vxlan]
[] ip_output+0x80/0xf0
[] ip_local_out+0x28/0x80
[] vxlan_xmit+0x66a/0xbec [vxlan]
[] ? vxlan_xmit+0x556/0xbec [vxlan]
[] ? skb_gso_segment+0x2b0/0x2b0
[] ? _raw_spin_unlock_irqrestore+0x65/0x80
[] ? dev_queue_xmit_nit+0x207/0x270
[] dev_hard_start_xmit+0x298/0x5d0
[] dev_queue_xmit+0x2f3/0x5d0
[] ? dev_hard_start_xmit+0x5d0/0x5d0
[] arp_xmit+0x58/0x60
[] arp_send+0x3b/0x40
[] arp_solicit+0x204/0x280
[] ? neigh_add+0x310/0x310
[] neigh_probe+0x45/0x70
[] neigh_timer_handler+0x1a0/0x2a0
[] call_timer_fn+0x7f/0x1c0
[] ? detach_if_pending+0x120/0x120
[] run_timer_softirq+0x238/0x2b0
[] ? neigh_add+0x310/0x310
[] __do_softirq+0x101/0x280
[] call_softirq+0x1c/0x30
[] do_softirq+0x85/0xc0
[] irq_exit+0x9e/0xc0
[] smp_apic_timer_interrupt+0x68/0xa0
[] apic_timer_interrupt+0x6f/0x80
[] ? mwait_idle+0xa4/0x1c0
[] ? mwait_idle+0x9b/0x1c0
[] cpu_idle+0x89/0xe0
[] start_secondary+0x1b2/0x1b6

Bug is from arp_solicit(), releasing the neigh lock after arp_send()
In case of vxlan, we eventually need to write lock a neigh lock later.

Its a false positive, but we can get rid of it without lockdep
annotations.

We can instead use neigh_ha_snapshot() helper.

Reported-by: Yan Burman
Signed-off-by: Eric Dumazet
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller

Eric Dumazet
2012-12-22 05:14:07 +0800
f7e75ba17 ip_gre: fix possible use after free ... Browse Code »

Once skb_realloc_headroom() is called, tiph might point to freed memory.

Cache tiph->ttl value before the reallocation, to avoid unexpected
behavior.

Signed-off-by: Eric Dumazet
Cc: Isaku Yamahata
Signed-off-by: David S. Miller

Eric Dumazet
2012-12-22 05:14:01 +0800
412ed9474 ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally ... Browse Code »

ipgre_tunnel_xmit() parses network header as IP unconditionally.
But transmitting packets are not always IP packet. For example such packet
can be sent by packet socket with sockaddr_ll.sll_protocol set.
So make the function check if skb->protocol is IP.

Signed-off-by: Isaku Yamahata
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Isaku Yamahata
2012-12-22 05:14:00 +0800

17 Dec, 2012

2 commits

c65ef8dc7 netfilter: nf_nat: Also handle non-ESTABLISHED routing changes in MASQUERADE ... Browse Code »

Since (a0ecb85 netfilter: nf_nat: Handle routing changes in MASQUERADE
target), the MASQUERADE target handles routing changes which affect
the output interface of a connection, but only for ESTABLISHED
connections. It is also possible for NEW connections which
already have a conntrack entry to be affected by routing changes.

This adds a check to drop entries in the NEW+conntrack state
when the oif has changed.

Signed-off-by: Andrew Collins
Acked-by: Jozsef Kadlecsik
Signed-off-by: Pablo Neira Ayuso

Andrew Collins
2012-12-17 06:28:30 +0800
c6f408996 netfilter: ip[6]t_REJECT: fix wrong transport header pointer in TCP reset ... Browse Code »

The problem occurs when iptables constructs the tcp reset packet.
It doesn't initialize the pointer to the tcp header within the skb.
When the skb is passed to the ixgbe driver for transmit, the ixgbe
driver attempts to access the tcp header and crashes.
Currently, other drivers (such as our 1G e1000e or igb drivers) don't
access the tcp header on transmit unless the TSO option is turned on.

BUG: unable to handle kernel NULL pointer dereference at 0000000d
IP: [] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe]
*pdpt = 0000000085e5d001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
[...]
Pid: 0, comm: swapper Tainted: P 2.6.35.12 #1 Greencity/Thurley
EIP: 0060:[] EFLAGS: 00010246 CPU: 16
EIP is at ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe]
EAX: c7628820 EBX: 00000007 ECX: 00000000 EDX: 00000000
ESI: 00000008 EDI: c6882180 EBP: dfc6b000 ESP: ced95c48
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=ced94000 task=ced73bd0 task.ti=ced94000)
Stack:
cbec7418 c779e0d8 c77cc888 c77cc8a8 0903010a 00000000 c77c0008 00000002
cd4997c0 00000010 dfc6b000 00000000 d0d176c9 c77cc8d8 c6882180 cbec7318
00000004 00000004 cbec7230 cbec7110 00000000 cbec70c0 c779e000 00000002
Call Trace:
[] ? 0xd0d176c9
[] ? 0xd0d18a4d
[] ? dev_hard_start_xmit+0x218/0x2d7
[] ? sch_direct_xmit+0x4b/0x114
[] ? __qdisc_run+0xca/0xe0
[] ? dev_queue_xmit+0x2d1/0x3d0
[] ? neigh_resolve_output+0x1c5/0x20f
[] ? neigh_update+0x29c/0x330
[] ? arp_process+0x49c/0x4cd
[] ? nf_hook_slow+0x3f/0xac
[] ? arp_process+0x0/0x4cd
[] ? arp_process+0x0/0x4cd
[] ? T.901+0x38/0x3b
[] ? arp_rcv+0xa3/0xb4
[] ? arp_process+0x0/0x4cd
[] ? __netif_receive_skb+0x32b/0x346
[] ? netif_receive_skb+0x5a/0x5f
[] ? napi_skb_finish+0x1b/0x30
[] ? ixgbe_xmit_frame_ring+0x1564/0x2260 [ixgbe]
[] ? lapic_next_event+0x13/0x16
[] ? clockevents_program_event+0xd2/0xe4
[] ? net_rx_action+0x55/0x127
[] ? __do_softirq+0x77/0xeb
[] ? do_softirq+0x23/0x27
[] ? do_IRQ+0x7d/0x8e
[] ? common_interrupt+0x29/0x30
[] ? mwait_idle+0x48/0x4d
[] ? cpu_idle+0x37/0x4c
Code: df 09 d7 0f 94 c2 0f b6 d2 e9 e7 fb ff ff 31 db 31 c0 e9 38
ff ff ff 80 78 06 06 0f 85 3e fb ff ff 8b 7c 24 38 8b 8f b8 00 00 00
b6 51 0d f6 c2 01 0f 85 27 fb ff ff 80 e2 02 75 0d 8b 6c 24
EIP: [] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe] SS:ESP

Signed-off-by: Mukund Jampala
Signed-off-by: Pablo Neira Ayuso

Mukund Jampala
2012-12-17 06:27:35 +0800

15 Dec, 2012

1 commit

e337e24d6 inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock ... Browse Code »

If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:

unreferenced object 0xffff88022e8a92c0 (size 1592):
comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
hex dump (first 32 bytes):
0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] kmemleak_alloc+0x21/0x3e
[] kmem_cache_alloc+0xb5/0xc5
[] sk_prot_alloc.isra.53+0x2b/0xcd
[] sk_clone_lock+0x16/0x21e
[] inet_csk_clone_lock+0x10/0x7b
[] tcp_create_openreq_child+0x21/0x481
[] tcp_v4_syn_recv_sock+0x3a/0x23b
[] tcp_check_req+0x29f/0x416
[] tcp_v4_do_rcv+0x161/0x2bc
[] tcp_v4_rcv+0x6c9/0x701
[] ip_local_deliver_finish+0x70/0xc4
[] ip_local_deliver+0x4e/0x7f
[] ip_rcv_finish+0x1fc/0x233
[] ip_rcv+0x217/0x267
[] __netif_receive_skb+0x49e/0x553
[] netif_receive_skb+0x50/0x82

This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.

This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...

Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.

Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().

A similar approach is taken for dccp by calling dccp_done().

This is in the kernel since 093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.

Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Christoph Paasch
2012-12-15 02:14:07 +0800

14 Dec, 2012

1 commit

a2013a13e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...

Linus Torvalds
2012-12-14 04:00:02 +0800

13 Dec, 2012

1 commit

6be35c700 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking changes from David Miller:

1) Allow to dump, monitor, and change the bridge multicast database
using netlink. From Cong Wang.

2) RFC 5961 TCP blind data injection attack mitigation, from Eric
Dumazet.

3) Networking user namespace support from Eric W. Biederman.

4) tuntap/virtio-net multiqueue support by Jason Wang.

5) Support for checksum offload of encapsulated packets (basically,
tunneled traffic can still be checksummed by HW). From Joseph
Gasparakis.

6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
Daniel Borkmann.

7) Bridge port parameters over netlink and BPDU blocking support
from Stephen Hemminger.

8) Improve data access patterns during inet socket demux by rearranging
socket layout, from Eric Dumazet.

9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
Jon Maloy.

10) Update TCP socket hash sizing to be more in line with current day
realities. The existing heurstics were choosen a decade ago.
From Eric Dumazet.

11) Fix races, queue bloat, and excessive wakeups in ATM and
associated drivers, from Krzysztof Mazur and David Woodhouse.

12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
in VXLAN driver, from David Stevens.

13) Add "oops_only" mode to netconsole, from Amerigo Wang.

14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
allow DCB netlink to work on namespaces other than the initial
namespace. From John Fastabend.

15) Support PTP in the Tigon3 driver, from Matt Carlson.

16) tun/vhost zero copy fixes and improvements, plus turn it on
by default, from Michael S. Tsirkin.

17) Support per-association statistics in SCTP, from Michele
Baldessari.

And many, many, driver updates, cleanups, and improvements. Too
numerous to mention individually.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
net/mlx4_en: Add support for destination MAC in steering rules
net/mlx4_en: Use generic etherdevice.h functions.
net: ethtool: Add destination MAC address to flow steering API
bridge: add support of adding and deleting mdb entries
bridge: notify mdb changes via netlink
ndisc: Unexport ndisc_{build,send}_skb().
uapi: add missing netconf.h to export list
pkt_sched: avoid requeues if possible
solos-pci: fix double-free of TX skb in DMA mode
bnx2: Fix accidental reversions.
bna: Driver Version Updated to 3.1.2.1
bna: Firmware update
bna: Add RX State
bna: Rx Page Based Allocation
bna: TX Intr Coalescing Fix
bna: Tx and Rx Optimizations
bna: Code Cleanup and Enhancements
ath9k: check pdata variable before dereferencing it
ath5k: RX timestamp is reported at end of frame
ath9k_htc: RX timestamp is reported at end of frame
...

Linus Torvalds
2012-12-13 10:07:07 +0800

11 Dec, 2012

2 commits

4b5511ebc net: remove obsolete simple_strto<foo> ... Browse Code »

This patch replace the obsolete simple_strto with kstrto

Signed-off-by: Abhijit Pawar
Acked-by: Neil Horman
Signed-off-by: David S. Miller

Abhijit Pawar
2012-12-11 03:09:00 +0800
1bf3751ec ipv4: ip_check_defrag must not modify skb before unsharing ... Browse Code »

ip_check_defrag() might be called from af_packet within the
RX path where shared SKBs are used, so it must not modify
the input SKB before it has unshared it for defragmentation.
Use skb_copy_bits() to get the IP header and only pull in
everything later.

The same is true for the other caller in macvlan as it is
called from dev->rx_handler which can also get a shared SKB.

Reported-by: Eric Leblond
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2012-12-11 02:51:44 +0800

10 Dec, 2012

4 commits

5e1f54201 inet_diag: validate port comparison byte code to prevent unsafe reads ... Browse Code »

Add logic to verify that a port comparison byte code operation
actually has the second inet_diag_bc_op from which we read the port
for such operations.

Previously the code blindly referenced op[1] without first checking
whether a second inet_diag_bc_op struct could fit there. So a
malicious user could make the kernel read 4 bytes beyond the end of
the bytecode array by claiming to have a whole port comparison byte
code (2 inet_diag_bc_op structs) when in fact the bytecode was not
long enough to hold both.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-12-10 08:00:48 +0800
f67caec90 inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() ... Browse Code »

Add logic to check the address family of the user-supplied conditional
and the address family of the connection entry. We now do not do
prefix matching of addresses from different address families (AF_INET
vs AF_INET6), except for the previously existing support for having an
IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
maintains as-is).

This change is needed for two reasons:

(1) The addresses are different lengths, so comparing a 128-bit IPv6
prefix match condition to a 32-bit IPv4 connection address can cause
us to unwittingly walk off the end of the IPv4 address and read
garbage or oops.

(2) The IPv4 and IPv6 address spaces are semantically distinct, so a
simple bit-wise comparison of the prefixes is not meaningful, and
would lead to bogus results (except for the IPv4-mapped IPv6 case,
which this commit maintains).

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-12-10 07:59:37 +0800
405c00594 inet_diag: validate byte code to prevent oops in inet_diag_bc_run() ... Browse Code »

Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
operations.

Previously we did not validate the inet_diag_hostcond, address family,
address length, and prefix length. So a malicious user could make the
kernel read beyond the end of the bytecode array by claiming to have a
whole inet_diag_hostcond when the bytecode was not long enough to
contain a whole inet_diag_hostcond of the given address family. Or
they could make the kernel read up to about 27 bytes beyond the end of
a connection address by passing a prefix length that exceeded the
length of addresses of the given family.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-12-10 07:59:37 +0800
1c95df85c inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state ... Browse Code »

Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
instantiated for IPv4 traffic and in the SYN-RECV state were actually
created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
means that for such connections inet6_rsk(req) returns a pointer to a
random spot in memory up to roughly 64KB beyond the end of the
request_sock.

With this bug, for a server using AF_INET6 TCP sockets and serving
IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
inet_diag_fill_req() causing an oops or the export to user space of 16
bytes of kernel memory as a garbage IPv6 address, depending on where
the garbage inet6_rsk(req) pointed.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-12-10 07:59:37 +0800