Doug / smarc-fsl-linux-kernel | Embedian Git Server

20 Sep, 2013

3 commits

b75ff5e84 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) If the local_df boolean is set on an SKB we have to allocate a
unique ID even if IP_DF is set in the ipv4 headers, from Ansis
Atteka.

2) Some fixups for the new chipset support that went into the sfc
driver, from Ben Hutchings.

3) Because SCTP bypasses a good chunk of, and actually duplicates, the
logic of the ipv6 output path, some IPSEC things don't get done
properly. Integrate SCTP better into the ipv6 output path so that
these problems are fixed and such issues don't get missed in the
future either. From Daniel Borkmann.

4) Fix skge regressions added by the DMA mapping error return checking
added in v3.10, from Mikulas Patocka.

5) Kill some more IRQF_DISABLED references, from Michael Opdenacker.

6) Fix races and deadlocks in the bridging code, from Hong Zhiguo.

7) Fix error handling in tun_set_iff(), in particular don't leak
resources. From Jason Wang.

8) Prevent format-string injection into xen-netback driver, from Kees
Cook.

9) Fix regression added to netpoll ARP packet handling, in particular
check for the right ETH_P_ARP protocol code. From Sonic Zhang.

10) Try to deal with AMD IOMMU errors when using r8169 chips, from
Francois Romieu.

11) Cure freezes due to recent changes in the rt2x00 wireless driver,
from Stanislaw Gruszka.

12) Don't do SPI transfers (which can sleep) in interrupt context in
cw1200 driver, from Solomon Peachy.

13) Fix LEDs handling bug in 5720 tg3 chips already handled for 5719.
From Nithin Sujir.

14) Make xen_netbk_count_skb_slots() count the actual number of slots
that will be used, taking into consideration packing and other
issues that the transmit path will run into. From David Vrabel.

15) Use the correct maximum age when calculating the bridge
message_age_timer, from Chris Healy.

16) Get rid of memory leaks in mcs7780 IRDA driver, from Alexey
Khoroshilov.

17) Netfilter conntrack extensions were converted to RCU but are not
always freed properly using kfree_rcu(). Fix from Michal Kubecek.

18) VF reset recovery not being done correctly in qlcnic driver, from
Manish Chopra.

19) Fix inverted test in ATM nicstar driver, from Andy Shevchenko.

20) Missing workqueue destroy in cxgb4 error handling, from Wei Yang.

21) Internal switch not initialized properly in bgmac driver, from Rafał
Miłecki.

22) Netlink messages report wrong local and remote addresses in IPv6
tunneling, from Ding Zhi.

23) ICMP redirects should not generate socket errors in DCCP and SCTP.
We're still working out how this should be handled for RAW and UDP
sockets. From Daniel Borkmann and Duan Jiong.

24) We've had several bugs wherein the network namespace's loopback
device gets accessed after it is free'd, NULL it out so that we can
catch these problems more readily. From Eric W Biederman.

25) Fix regression in TCP RTO calculations, from Neal Cardwell.

26) Fix too early free of xen-netback network device when VIFs still
exist. From Paul Durrant.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
netconsole: fix a deadlock with rtnl and netconsole's mutex
netpoll: fix NULL pointer dereference in netpoll_cleanup
skge: fix broken driver
ip: generate unique IP identificator if local fragmentation is allowed
ip: use ip_hdr() in __ip_make_skb() to retrieve IP header
xen-netback: Don't destroy the netdev until the vif is shut down
net:dccp: do not report ICMP redirects to user space
cnic: Fix crash in cnic_bnx2x_service_kcq()
bnx2x, cnic, bnx2i, bnx2fc: Fix bnx2i and bnx2fc regressions.
vxlan: Avoid creating fdb entry with NULL destination
tcp: fix RTO calculated from cached RTT
drivers: net: phy: cicada.c: clears warning Use #include instead of
net loopback: Set loopback_dev to NULL when freed
batman-adv: set the TAG flag for the vid passed to BLA
netfilter: nfnetlink_queue: use network skb for sequence adjustment
net: sctp: rfc4443: do not report ICMP redirects to user space
net: usb: cdc_ether: use usb.h macros whenever possible
net: usb: cdc_ether: fix checkpatch errors and warnings
net: usb: cdc_ether: Use wwan interface for Telit modules
ip6_tunnels: raddr and laddr are inverted in nl msg
...

Linus Torvalds
2013-09-20 02:57:28 +0800
703133de3 ip: generate unique IP identificator if local fragmentation is allowed ... Browse Code »

If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.

For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.

Signed-off-by: Ansis Atteka
Signed-off-by: David S. Miller

Ansis Atteka
2013-09-20 02:11:15 +0800
749154aa5 ip: use ip_hdr() in __ip_make_skb() to retrieve IP header ... Browse Code »

skb->data already points to IP header, but for the sake of
consistency we can also use ip_hdr() to retrieve it.

Signed-off-by: Ansis Atteka
Signed-off-by: David S. Miller

Ansis Atteka
2013-09-20 02:11:15 +0800

18 Sep, 2013

1 commit

269aa759b tcp: fix RTO calculated from cached RTT ... Browse Code »

Commit 1b7fdd2ab5852 ("tcp: do not use cached RTT for RTT estimation")
did not correctly account for the fact that crtt is the RTT shifted
left 3 bits. Fix the calculation to consistently reflect this fact.

Signed-off-by: Neal Cardwell
Cc: Eric Dumazet
Cc: Yuchung Cheng
Acked-by: Eric Dumazet
Acked-By: Yuchung Cheng
Signed-off-by: David S. Miller

Neal Cardwell
2013-09-18 07:08:08 +0800

13 Sep, 2013

1 commit

6de5a8bfc memcg: rename RESOURCE_MAX to RES_COUNTER_MAX ... Browse Code »

RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

Signed-off-by: Sha Zhengju
Signed-off-by: Qiang Huang
Acked-by: Michal Hocko
Cc: Daisuke Nishimura
Cc: Jeff Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800

07 Sep, 2013

2 commits

4e4f1fc22 tcp: properly increase rcv_ssthresh for ofo packets ... Browse Code »

TCP receive window handling is multi staged.

A socket has a memory budget, static or dynamic, in sk_rcvbuf.

Because we do not really know how this memory budget translates to
a TCP window (payload), TCP announces a small initial window
(about 20 MSS).

When a packet is received, we increase TCP rcv_win depending
on the payload/truesize ratio of this packet. Good citizen
packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2

This heuristic takes place in tcp_grow_window()

Problem is : We currently call tcp_grow_window() only for in-order
packets.

This means that reorders or packet losses stop proper grow of
rcv_win, and senders are unable to benefit from fast recovery,
or proper reordering level detection.

Really, a packet being stored in OFO queue is not a bad citizen.
It should be part of the game as in-order packets.

In our traces, we very often see sender is limited by linux small
receive windows, even if linux hosts use autotuning (DRS) and should
allow rcv_win to grow to ~3MB.

Signed-off-by: Eric Dumazet
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2013-09-07 02:43:49 +0800
16edfe7ee tcp: fix no cwnd growth after timeout ... Browse Code »

In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
it only allows cwnd to increase in Open state. This mistakenly disables
slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
state moves from Disorder to Open later in tcp_fastretrans_alert().

Therefore the correct logic should be to allow cwnd to grow as long
as the data is received in order in Open, Loss, or even Disorder state.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2013-09-07 02:43:49 +0800

06 Sep, 2013

3 commits

cc998ff88 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking changes from David Miller:
"Noteworthy changes this time around:

1) Multicast rejoin support for team driver, from Jiri Pirko.

2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.

From Yuchung Cheng.

3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.

4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.

5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.

6) Add a networking development FAQ, from Paul Gortmaker.

7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.

8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.

9) Add SCTP support to openvswitch, from Joe Stringer.

10) Add EF10 chip support to SFC driver, from Ben Hutchings.

11) Add new SYNPROXY netfilter target, from Patrick McHardy.

12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.

13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.

14) Add ipv6 support to vxlan driver, from Cong Wang"

Resolved conflicts as per discussion.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...

Linus Torvalds
2013-09-06 05:54:29 +0800
06c54055b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
net/bridge/br_multicast.c
net/ipv6/sit.c

The conflicts were minor:

1) sit.c changes overlap with change to ip_tunnel_xmit() signature.

2) br_multicast.c had an overlap between computing max_delay using
msecs_to_jiffies and turning MLDV2_MRC() into an inline function
with a name using lowercase instead of uppercase letters.

3) stmmac had two overlapping changes, one which conditionally allocated
and hooked up a dma_cfg based upon the presence of the pbl OF property,
and another one handling store-and-forward DMA made. The latter of
which should not go into the new of_find_property() basic block.

Signed-off-by: David S. Miller

David S. Miller
2013-09-06 02:58:52 +0800
e2e5c4c07 tcp: Add missing braces to do_tcp_setsockopt ... Browse Code »

Signed-off-by: Dave Jones
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Dave Jones
2013-09-06 02:31:02 +0800

05 Sep, 2013

3 commits

27703bb4a Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull PTR_RET() removal patches from Rusty Russell:
"PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.

This has been sitting in linux-next for a whole cycle"

[ There are still some PTR_RET users scattered about, with some of them
possibly being new, but most of them existing in Rusty's tree too. We
have that

#define PTR_RET(p) PTR_ERR_OR_ZERO(p)

thing in , so they continue to work for now - Linus ]

* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
staging/zcache: don't use PTR_RET().
remoteproc: don't use PTR_RET().
pinctrl: don't use PTR_RET().
acpi: Replace weird use of PTR_RET.
s390: Replace weird use of PTR_RET.
PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
PTR_RET is now PTR_ERR_OR_ZERO

Linus Torvalds
2013-09-05 08:31:11 +0800
52f20e655 tcp: better comments for RTO initiallization ... Browse Code »

Commit 1b7fdd2ab585("tcp: do not use cached RTT for RTT estimation")
removes important comments on how RTO is initialized and updated.
Hopefully this patch puts those information back.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2013-09-05 02:41:55 +0800
48f8e0af8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
The following batch contains:

* Three fixes for the new synproxy target available in your
net-next tree, from Jesper D. Brouer and Patrick McHardy.

* One fix for TCPMSS to correctly handling the fragmentation
case, from Phil Oester. I'll pass this one to -stable.
====================

Signed-off-by: David S. Miller

David S. Miller
2013-09-05 00:28:02 +0800

04 Sep, 2013

10 commits

7cc9eb6ef netfilter: SYNPROXY: let unrelated packets continue ... Browse Code »

Packets reaching SYNPROXY were default dropped, as they were most
likely invalid (given the recommended state matching). This
patch, changes SYNPROXY target to let packets, not consumed,
continue being processed by the stack.

This will be more in line other target modules. As it will allow
more flexible configurations of handling, logging or matching on
packets in INVALID states.

Signed-off-by: Jesper Dangaard Brouer
Acked-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso

Jesper Dangaard Brouer
2013-09-04 17:44:23 +0800
775ada6d9 netfilter: more strict TCP flag matching in SYNPROXY ... Browse Code »

Its seems Patrick missed to incoorporate some of my requested changes
during review v2 of SYNPROXY netfilter module.

Which were, to avoid SYN+ACK packets to enter the path, meant for the
ACK packet from the client (from the 3WHS).

Further there were a bug in ip6t_SYNPROXY.c, for matching SYN packets
that didn't exclude the ACK flag.

Go a step further with SYN packet/flag matching by excluding flags
ACK+FIN+RST, in both IPv4 and IPv6 modules.

The intented usage of SYNPROXY is as follows:
(gracefully describing usage in commit)

iptables -t raw -A PREROUTING -i eth0 -p tcp --dport 80 --syn -j NOTRACK
iptables -A INPUT -i eth0 -p tcp --dport 80 -m state UNTRACKED,INVALID \
-j SYNPROXY --sack-perm --timestamp --mss 1480 --wscale 7 --ecn

echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

This does filter SYN flags early, for packets in the UNTRACKED state,
but packets in the INVALID state with other TCP flags could still
reach the module, thus this stricter flag matching is still needed.

Signed-off-by: Jesper Dangaard Brouer
Acked-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso

Jesper Dangaard Brouer
2013-09-04 17:43:11 +0800
c995ae225 tcp: Change return value of tcp_rcv_established() ... Browse Code »

tcp_rcv_established() returns only one value namely 0. We change the return
value to void (as suggested by David Miller).

After commit 0c24604b (tcp: implement RFC 5961 4.2), we no longer send RSTs in
response to SYNs. We can remove the check and processing on the return value of
tcp_rcv_established().

We also fix jtcp_rcv_established() in tcp_probe.c to match that of
tcp_rcv_established().

Signed-off-by: Vijay Subramanian
Signed-off-by: David S. Miller

Vijay Subramanian
2013-09-04 12:27:28 +0800
cc8c6c1b2 net: tcp_probe: adapt tbuf size for recent changes ... Browse Code »

With recent changes in tcp_probe module (e.g. f925d0a62d ("net: tcp_probe:
add IPv6 support")) we also need to take into account that tbuf needs to
be updated as format string will be further expanded. tbuf sits on the stack
in tcpprobe_read() function that is invoked when user space reads procfs
file /proc/net/tcpprobe, hence not fast path as in jtcp_rcv_established().
Having a size similarly as in sctp_probe module of 256 bytes is fully
sufficient for that, we need theoretical maximum of 252 bytes otherwise we
could get truncated.

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2013-09-04 12:27:28 +0800
ea23192e8 tunnels: harmonize cleanup done on skb on rx path ... Browse Code »

The goal of this patch is to harmonize cleanup done on a skbuff on rx path.
Before this patch, behaviors were different depending of the tunnel type.

Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2013-09-04 12:27:26 +0800
963a88b31 tunnels: harmonize cleanup done on skb on xmit path ... Browse Code »

The goal of this patch is to harmonize cleanup done on a skbuff on xmit path.
Before this patch, behaviors were different depending of the tunnel type.

Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2013-09-04 12:27:25 +0800
8b27f2779 skb: allow skb_scrub_packet() to be used by tunnels ... Browse Code »

This function was only used when a packet was sent to another netns. Now, it can
also be used after tunnel encapsulation or decapsulation.

Only skb_orphan() should not be done when a packet is not crossing netns.

Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2013-09-04 12:27:25 +0800
8b7ed2d91 iptunnels: remove net arg from iptunnel_xmit() ... Browse Code »

This argument is not used, let's remove it.

Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2013-09-04 12:27:25 +0800
3e25c65ed net: neighbour: Remove CONFIG_ARPD ... Browse Code »

This config option is superfluous in that it only guards a call
to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
change in behavior. There will now be call to __neigh_notify()
for each ARP resolution, which has no impact unless there is a
user space daemon waiting to receive the notification, i.e.,
the case for which CONFIG_ARPD was designed anyways.

Suggested-by: Eric W. Biederman
Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Cc: "Eric W. Biederman"
Cc: Gao feng
Cc: Joe Perches
Cc: Veaceslav Falico
Signed-off-by: Tim Gardner
Reviewed-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Tim Gardner
2013-09-04 09:41:43 +0800
32dad03d1 Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot of activities on the cgroup front. Most changes aren't visible
to userland at all at this point and are laying foundation for the
planned unified hierarchy.

- The biggest change is decoupling the lifetime management of css
(cgroup_subsys_state) from that of cgroup's. Because controllers
(cpu, memory, block and so on) will need to be dynamically enabled
and disabled, css which is the association point between a cgroup
and a controller may come and go dynamically across the lifetime of
a cgroup. Till now, css's were created when the associated cgroup
was created and stayed till the cgroup got destroyed.

Assumptions around this tight coupling permeated through cgroup
core and controllers. These assumptions are gradually removed,
which consists bulk of patches, and css destruction path is
completely decoupled from cgroup destruction path. Note that
decoupling of creation path is relatively easy on top of these
changes and the patchset is pending for the next window.

- cgroup has its own event mechanism cgroup.event_control, which is
only used by memcg. It is overly complex trying to achieve high
flexibility whose benefits seem dubious at best. Going forward,
new events will simply generate file modified event and the
existing mechanism is being made specific to memcg. This pull
request contains prepatory patches for such change.

- Various fixes and cleanups"

Fixed up conflict in kernel/cgroup.c as per Tejun.

* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
cgroup: fix cgroup_css() invocation in css_from_id()
cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
cgroup: implement CFTYPE_NO_PREFIX
cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
cgroup: fix cgroup_write_event_control()
cgroup: fix subsystem file accesses on the root cgroup
cgroup: change cgroup_from_id() to css_from_id()
cgroup: use css_get() in cgroup_create() to check CSS_ROOT
cpuset: remove an unncessary forward declaration
cgroup: RCU protect each cgroup_subsys_state release
cgroup: move subsys file removal to kill_css()
cgroup: factor out kill_css()
cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
cgroup: replace cgroup->css_kill_cnt with ->nr_css
cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
cgroup: move cgroup->subsys[] assignment to online_css()
cgroup: reorganize css init / exit paths
cgroup: add __rcu modifier to cgroup->subsys[]
...

Linus Torvalds
2013-09-04 09:25:03 +0800

03 Sep, 2013

1 commit

5a17a390d net: make snmp_mib_free static inline ... Browse Code »

Fengguang reported:

net/built-in.o: In function `in6_dev_finish_destroy':
(.text+0x4ca7d): undefined reference to `snmp_mib_free'

this is due to snmp_mib_free() is defined when CONFIG_INET is enabled,
but in6_dev_finish_destroy() is now moved to core kernel.

I think snmp_mib_free() is small enough to be inlined, so just make it
static inline.

Reported-by: kbuild test robot
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2013-09-03 12:00:50 +0800

01 Sep, 2013

1 commit

eb3c0d83c net: unify skb_udp_tunnel_segment() and skb_udp6_tunnel_segment() ... Browse Code »

As suggested by Pravin, we can unify the code in case of duplicated
code.

Cc: Pravin Shelar
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2013-09-01 10:30:01 +0800

31 Aug, 2013

3 commits

737e828bd ipv4 tunnels: fix an oops when using ipip/sit with IPsec ... Browse Code »

Since commit 3d7b46cd20e3 (ip_tunnel: push generic protocol handling to
ip_tunnel module.), an Oops is triggered when an xfrm policy is configured on
an IPv4 over IPv4 tunnel.

xfrm4_policy_check() calls __xfrm_policy_check2(), which uses skb_dst(skb). But
this field is NULL because iptunnel_pull_header() calls skb_dst_drop(skb).

Signed-off-by: Li Hongjun
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Li Hongjun
2013-08-31 05:13:28 +0800
eb8895deb tcp: tcp_make_synack() should use sock_wmalloc ... Browse Code »

In commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb()), Eric changed
the call to sock_wmalloc in tcp_make_synack to alloc_skb. In doing so,
the netfilter owner match lost its ability to block the SYNACK packet on
outbound listening sockets. Revert the change, restoring the owner match
functionality.

This closes netfilter bugzilla #847.

Signed-off-by: Phil Oester
Signed-off-by: David S. Miller

Phil Oester
2013-08-31 04:02:04 +0800
1b7fdd2ab tcp: do not use cached RTT for RTT estimation ... Browse Code »

RTT cached in the TCP metrics are valuable for the initial timeout
because SYN RTT usually does not account for serialization delays
on low BW path.

However using it to seed the RTT estimator maybe disruptive because
other components (e.g., pacing) require the smooth RTT to be obtained
from actual connection.

The solution is to use the higher cached RTT to set the first RTO
conservatively like tcp_rtt_estimator(), but avoid seeding the other
RTT estimator variables such as srtt. It is also a good idea to
keep RTO conservative to obtain the first RTT sample, and the
performance is insured by TCP loss probe if SYN RTT is available.

To keep the seeding formula consistent across SYN RTT and cached RTT,
the rttvar is twice the cached RTT instead of cached RTTVAR value. The
reason is because cached variation may be too small (near min RTO)
which defeats the purpose of being conservative on first RTO. However
the metrics still keep the RTT variations as they might be useful for
user applications (through ip).

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Tested-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2013-08-31 03:14:38 +0800

30 Aug, 2013

5 commits

79f9ab7e0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec ... Browse Code »

Steffen Klassert says:

====================
This pull request fixes some issues that arise when 6in4 or 4in6 tunnels
are used in combination with IPsec, all from Hannes Frederic Sowa and a
null pointer dereference when queueing packets to the policy hold queue.

1) We might access the local error handler of the wrong address family if
6in4 or 4in6 tunnel is protected by ipsec. Fix this by addind a pointer
to the correct local_error to xfrm_state_afinet.

2) Add a helper function to always refer to the correct interpretation
of skb->sk.

3) Call skb_reset_inner_headers to record the position of the inner headers
when adding a new one in various ipv6 tunnels. This is needed to identify
the addresses where to send back errors in the xfrm layer.

4) Dereference inner ipv6 header if encapsulated to always call the
right error handler.

5) Choose protocol family by skb protocol to not call the wrong
xfrm{4,6}_local_error handler in case an ipv6 sockets is used
in ipv4 mode.

6) Partly revert "xfrm: introduce helper for safe determination of mtu"
because this introduced pmtu discovery problems.

7) Set skb->protocol on tcp, raw and ip6_append_data genereated skbs.
We need this to get the correct mtu informations in xfrm.

8) Fix null pointer dereference in xdst_queue_output.
====================

Signed-off-by: David S. Miller

David S. Miller
2013-08-30 04:05:30 +0800
c27c9322d ipv4: sendto/hdrincl: don't use destination address found in header ... Browse Code »

ipv4: raw_sendmsg: don't use header's destination address

A sendto() regression was bisected and found to start with commit
f8126f1d5136be1 (ipv4: Adjust semantics of rt->rt_gateway.)

The problem is that it tries to ARP-lookup the constructed packet's
destination address rather than the explicitly provided address.

Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.

cf. commit 2ad5b9e4bd314fc685086b99e90e5de3bc59e26b

Reported-by: Chris Clark
Bisected-by: Chris Clark
Tested-by: Chris Clark
Suggested-by: Julian Anastasov
Signed-off-by: Chris Clark
Signed-off-by: David S. Miller

Chris Clark
2013-08-30 03:57:52 +0800
95bd09eb2 tcp: TSO packets automatic sizing ... Browse Code »

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs

Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Yuchung Cheng
Cc: Van Jacobson
Cc: Tom Herbert
Acked-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2013-08-30 03:50:06 +0800
e3e120283 tcp: don't apply tsoffset if rcv_tsecr is zero ... Browse Code »

The zero value means that tsecr is not valid, so it's a special case.

tsoffset is used to customize tcp_time_stamp for one socket.
tsoffset is usually zero, it's used when a socket was moved from one
host to another host.

Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
TCP_RTO_MAX.

Cc: Pavel Emelyanov
Cc: Eric Dumazet
Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Reported-by: Cyrill Gorcunov
Signed-off-by: Andrey Vagin
Signed-off-by: David S. Miller

Andrew Vagin
2013-08-30 03:11:12 +0800
c7781a6e3 tcp: initialize rcv_tstamp for restored sockets ... Browse Code »

u32 rcv_tstamp; /* timestamp of last received ACK */

Its value used in tcp_retransmit_timer, which closes socket
if the last ack was received more then TCP_RTO_MAX ago.

Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
is called before receiving a first ack, the connection is closed.

This patch initializes rcv_tstamp to a timestamp, when a socket was
restored.

Cc: Pavel Emelyanov
Cc: Eric Dumazet
Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Reported-by: Cyrill Gorcunov
Signed-off-by: Andrey Vagin
Signed-off-by: David S. Miller

Andrew Vagin
2013-08-30 03:11:11 +0800

28 Aug, 2013

5 commits

48b1de4c1 netfilter: add SYNPROXY core/target ... Browse Code »

Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.

The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.

It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.

Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.

Signed-off-by: Patrick McHardy
Tested-by: Martin Topholm
Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Pablo Neira Ayuso

Patrick McHardy
2013-08-28 06:27:54 +0800
0198230b7 net: syncookies: export cookie_v4_init_sequence/cookie_v4_check ... Browse Code »

Extract the local TCP stack independant parts of tcp_v4_init_sequence()
and cookie_v4_check() and export them for use by the upcoming SYNPROXY
target.

Signed-off-by: Patrick McHardy
Acked-by: David S. Miller
Tested-by: Martin Topholm
Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Pablo Neira Ayuso

Patrick McHardy
2013-08-28 06:27:44 +0800
41d73ec05 netfilter: nf_conntrack: make sequence number adjustments usuable without NAT ... Browse Code »

Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.

As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.

Signed-off-by: Patrick McHardy
Tested-by: Martin Topholm
Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Pablo Neira Ayuso

Patrick McHardy
2013-08-28 06:26:48 +0800
affe759db netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridged ... Browse Code »

As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
with the tcp-reset option sends out reset packets with the src MAC address
of the local bridge interface, instead of the MAC address of the intended
destination. This causes some routers/firewalls to drop the reset packet
as it appears to be spoofed. Fix this by bypassing ip[6]_local_out and
setting the MAC of the sender in the tcp reset packet.

This closes netfilter bugzilla #531.

Signed-off-by: Phil Oester
Signed-off-by: Pablo Neira Ayuso

Phil Oester
2013-08-28 06:13:12 +0800
b1dcdc68b net: tcp_probe: allow more advanced ingress filtering by mark ... Browse Code »

Currently, the tcp_probe snooper can either filter packets by a given
port (handed to the module via module parameter e.g. port=80) or lets
all TCP traffic pass (port=0, default). When a port is specified, the
port number is tested against the sk's source/destination port. Thus,
if one of them matches, the information will be further processed for
the log.

As this is quite limited, allow for more advanced filtering possibilities
which can facilitate debugging/analysis with the help of the tcp_probe
snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
("pkt_sched: ingress socket filter by mark"), add the possibility to
use skb->mark as a filter.

If the mark is not being used otherwise, this allows ingress filtering
by flow (e.g. in order to track updates from only a single flow, or a
subset of all flows for a given port) and other things such as dynamic
logging and reconfiguration without removing/re-inserting the tcp_probe
module, etc. Simple example:

insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
...
iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
--sport 60952 -j MARK --set-mark 8888
[... sampling interval ...]
iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
--sport 60952 -j MARK --set-mark 8888

The current option to filter by a given port is still being preserved. A
similar approach could be done for the sctp_probe module as a follow-up.

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2013-08-28 03:53:34 +0800

27 Aug, 2013

1 commit

b05930f5d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/wireless/iwlwifi/pcie/trans.c
include/linux/inetdevice.h

The inetdevice.h conflict involves moving the IPV4_DEVCONF values
into a UAPI header, overlapping additions of some new entries.

The iwlwifi conflict is a context overlap.

Signed-off-by: David S. Miller

David S. Miller
2013-08-27 04:37:08 +0800

26 Aug, 2013

1 commit

5a25cf1e3 xfrm: revert ipv4 mtu determination to dst_mtu ... Browse Code »

In commit 0ea9d5e3e0e03a63b11392f5613378977dae7eca ("xfrm: introduce
helper for safe determination of mtu") I switched the determination of
ipv4 mtus from dst_mtu to ip_skb_dst_mtu. This was an error because in
case of IP_PMTUDISC_PROBE we fall back to the interface mtu, which is
never correct for ipv4 ipsec.

This patch partly reverts 0ea9d5e3e0e03a63b11392f5613378977dae7eca
("xfrm: introduce helper for safe determination of mtu").

Cc: Steffen Klassert
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: Steffen Klassert

Hannes Frederic Sowa
2013-08-26 18:40:53 +0800