Eric Lee / smarc-fsl-linux-kernel

14 Oct, 2013

1 commit

a4ae4c617 tcp: Add missing braces to do_tcp_setsockopt ... Browse Code »

[ Upstream commit e2e5c4c07caf810d7849658dca42f598b3938e21 ]

Signed-off-by: Dave Jones
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Dave Jones
2013-10-14 07:08:28 +0800

14 Sep, 2013

1 commit

f3f905389 tcp: set timestamps for restored skb-s ... Browse Code »

[ Upstream commit 7ed5c5ae96d23da22de95e1c7a239537acd378b1 ]

When the repair mode is turned off, the write queue seqs are
updated so that the whole queue is considered to be 'already sent.

The "when" field must be set for such skb. It's used in tcp_rearm_rto
for example. If the "when" field isn't set, the retransmit timeout can
be calculated incorrectly and a tcp connected can stop for two minutes
(TCP_RTO_MAX).

Acked-by: Pavel Emelyanov
Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Signed-off-by: Andrey Vagin
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Andrey Vagin
2013-09-14 21:54:55 +0800

17 May, 2013

1 commit

6ff50cd55 tcp: gso: do not generate out of order packets ... Browse Code »

GSO TCP handler has following issues :

1) ooo_okay from original GSO packet is duplicated to all segments
2) segments (but the last one) are orphaned, so transmit path can not
get transmit queue number from the socket. This happens if GSO
segmentation is done before stacked device for example.

Result is we can send packets from a given TCP flow to different TX
queues (if using multiqueue NICS). This generates OOO problems and
spurious SACK & retransmits.

Fix this by keeping socket pointer set for all segments.

This means that every segment must also have a destructor, and the
original gso skb truesize must be split on all segments, to keep
precise sk->sk_wmem_alloc accounting.

Signed-off-by: Eric Dumazet
Cc: Maciej Żenczykowski
Cc: Tom Herbert
Cc: Neal Cardwell
Cc: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2013-05-17 05:43:40 +0800

15 May, 2013

1 commit

54d27fcb3 tcp: fix tcp_md5_hash_skb_data() ... Browse Code »

TCP md5 communications fail [1] for some devices, because sg/crypto code
assume page offsets are below PAGE_SIZE.

This was discovered using mlx4 driver [2], but I suspect loopback
might trigger the same bug now we use order-3 pages in tcp_sendmsg()

[1] Failure is giving following messages.

huh, entered softirq 3 NET_RX ffffffff806ad230 preempt_count 00000100,
exited with 00000101?

[2] mlx4 driver uses order-2 pages to allocate RX frags

Reported-by: Matt Schnall
Signed-off-by: Eric Dumazet
Cc: Bernhard Beck
Signed-off-by: David S. Miller

Eric Dumazet
2013-05-15 02:32:04 +0800

14 Apr, 2013

1 commit

bece1b970 tcp: tcp_tso_segment() small optimization ... Browse Code »

We can move th->check computation out of the loop, as compiler
doesn't know each skb initially share same tcp headers after
skb_segment()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-04-14 04:54:14 +0800

13 Apr, 2013

1 commit

d6a4a1041 tcp: GSO should be TSQ friendly ... Browse Code »

I noticed that TSQ (TCP Small queues) was less effective when TSO is
turned off, and GSO is on. If BQL is not enabled, TSQ has then no
effect.

It turns out the GSO engine frees the original gso_skb at the time the
fragments are generated and queued to the NIC.

We should instead call the tcp_wfree() destructor for the last fragment,
to keep the flow control as intended in TSQ. This effectively limits
the number of queued packets on qdisc + NIC layers.

Signed-off-by: Eric Dumazet
Cc: Tom Herbert
Cc: Yuchung Cheng
Cc: Nandita Dukkipati
Cc: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2013-04-13 06:17:06 +0800

21 Mar, 2013

1 commit

61816596d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull in the 'net' tree to get Daniel Borkmann's flow dissector
infrastructure change.

Signed-off-by: David S. Miller

David S. Miller
2013-03-21 00:46:26 +0800

18 Mar, 2013

1 commit

1a2c6181c tcp: Remove TCPCT ... Browse Code »

TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.

As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:

Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.

before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second

Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Christoph Paasch
2013-03-18 02:35:13 +0800

14 Mar, 2013

1 commit

16fad69cf tcp: fix skb_availroom() ... Browse Code »

Chrome OS team reported a crash on a Pixel ChromeBook in TCP stack :

https://code.google.com/p/chromium/issues/detail?id=182056

commit a21d45726acac (tcp: avoid order-1 allocations on wifi and tx
path) did a poor choice adding an 'avail_size' field to skb, while
what we really needed was a 'reserved_tailroom' one.

It would have avoided commit 22b4a4f22da (tcp: fix retransmit of
partially acked frames) and this commit.

Crash occurs because skb_split() is not aware of the 'avail_size'
management (and should not be aware)

Signed-off-by: Eric Dumazet
Reported-by: Mukesh Agrawal
Signed-off-by: David S. Miller

Eric Dumazet
2013-03-14 23:49:45 +0800

10 Mar, 2013

1 commit

731362674 tunneling: Add generic Tunnel segmentation. ... Browse Code »

Adds generic tunneling offloading support for IPv4-UDP based
tunnels.
GSO type is added to request this offload for a skb.
netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
udp-tunnel support. Currently no device supports this feature,
software offload is used.

This can be used by tunneling protocols like VXLAN.

CC: Jesse Gross
Signed-off-by: Pravin B Shelar
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller

Pravin B Shelar
2013-03-10 05:09:17 +0800

27 Feb, 2013

1 commit

5115f3c19 Merge branch 'next' of git://git.infradead.org/users/vkoul/slave-dma ... Browse Code »

Pull slave-dmaengine updates from Vinod Koul:
"This is fairly big pull by my standards as I had missed last merge
window. So we have the support for device tree for slave-dmaengine,
large updates to dw_dmac driver from Andy for reusing on different
architectures. Along with this we have fixes on bunch of the drivers"

Fix up trivial conflicts, usually due to #include line movement next to
each other.

* 'next' of git://git.infradead.org/users/vkoul/slave-dma: (111 commits)
Revert "ARM: SPEAr13xx: Pass DW DMAC platform data from DT"
ARM: dts: pl330: Add #dma-cells for generic dma binding support
DMA: PL330: Register the DMA controller with the generic DMA helpers
DMA: PL330: Add xlate function
DMA: PL330: Add new pl330 filter for DT case.
dma: tegra20-apb-dma: remove unnecessary assignment
edma: do not waste memory for dma_mask
dma: coh901318: set residue only if dma is in progress
dma: coh901318: avoid unbalanced locking
dmaengine.h: remove redundant else keyword
dma: of-dma: protect list write operation by spin_lock
dmaengine: ste_dma40: do not remove descriptors for cyclic transfers
dma: of-dma.c: fix memory leakage
dw_dmac: apply default dma_mask if needed
dmaengine: ioat - fix spare sparse complain
dmaengine: move drivers/of/dma.c -> drivers/dma/of-dma.c
ioatdma: fix race between updating ioat->head and IOAT_COMPLETION_PENDING
dw_dmac: add support for Lynxpoint DMA controllers
dw_dmac: return proper residue value
dw_dmac: fill individual length of descriptor
...

Linus Torvalds
2013-02-27 01:24:48 +0800

16 Feb, 2013

1 commit

68c331631 v4 GRE: Add TCP segmentation offload for GRE ... Browse Code »

Following patch adds GRE protocol offload handler so that
skb_gso_segment() can segment GRE packets.
SKB GSO CB is added to keep track of total header length so that
skb_segment can push entire header. e.g. in case of GRE, skb_segment
need to push inner and outer headers to every segment.
New NETIF_F_GRE_GSO feature is added for devices which support HW
GRE TSO offload. Currently none of devices support it therefore GRE GSO
always fall backs to software GSO.

[ Compute pkt_len before ip_local_out() invocation. -DaveM ]

Signed-off-by: Pravin B Shelar
Signed-off-by: David S. Miller

Pravin B Shelar
2013-02-16 04:17:11 +0800

14 Feb, 2013

3 commits

c9af6db4c net: Fix possible wrong checksum generation. ... Browse Code »

Patch cef401de7be8c4e (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.

Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.

tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.

Signed-off-by: Pravin B Shelar
Signed-off-by: David S. Miller

Pravin B Shelar
2013-02-14 02:30:10 +0800
93be6ce0e tcp: set and get per-socket timestamp ... Browse Code »

A timestamp can be set, only if a socket is in the repair mode.

This patch adds a new socket option TCP_TIMESTAMP, which allows to
get and set current tcp times stamp.

Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Cc: Eric Dumazet
Cc: Pavel Emelyanov
Signed-off-by: Andrey Vagin
Signed-off-by: David S. Miller

Andrey Vagin
2013-02-14 02:22:15 +0800
ceaa1fef6 tcp: adding a per-socket timestamp offset ... Browse Code »

This functionality is used for restoring tcp sockets. A tcp timestamp
depends on how long a system has been running, so it's differ for each
host. The solution is to set a per-socket offset.

A per-socket offset for a TIME_WAIT socket is inherited from a proper
tcp socket.

tcp_request_sock doesn't have a timestamp offset, because the repair
mode for them are not implemented.

Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Cc: Eric Dumazet
Cc: Pavel Emelyanov
Signed-off-by: Andrey Vagin
Signed-off-by: David S. Miller

Andrey Vagin
2013-02-14 02:22:15 +0800

06 Feb, 2013

1 commit

ca2eb5679 tcp: remove Appropriate Byte Count support ... Browse Code »

TCP Appropriate Byte Count was added by me, but later disabled.
There is no point in maintaining it since it is a potential source
of bugs and Linux already implements other better window protection
heuristics.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2013-02-06 03:51:16 +0800

28 Jan, 2013

1 commit

cef401de7 net: fix possible wrong checksum generation ... Browse Code »

Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.

He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().

This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.

Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.

Tested:

$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 3959.52

$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 3216.80

Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.

Reported-by: Pravin Shelar
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-01-28 13:27:15 +0800

23 Jan, 2013

1 commit

50c3a487d ipv4: Use IS_ERR_OR_NULL(). ... Browse Code »

Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

YOSHIFUJI Hideaki / 吉藤英明
2013-01-23 03:28:28 +0800

11 Jan, 2013

2 commits

f26845b43 tcp: fix splice() and tcp collapsing interaction ... Browse Code »

Under unusual circumstances, TCP collapse can split a big GRO TCP packet
while its being used in a splice(socket->pipe) operation.

skb_splice_bits() releases the socket lock before calling
splice_to_pipe().

[ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
[ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
[ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13

To fix this problem, we must eat skbs in tcp_recv_skb().

Remove the inline keyword from tcp_recv_skb() definition since
it has three call sites.

Reported-by: Christian Becker
Cc: Willy Tarreau
Signed-off-by: Eric Dumazet
Tested-by: Willy Tarreau
Signed-off-by: David S. Miller

Eric Dumazet
2013-01-11 06:09:57 +0800
ff905b1e4 tcp: splice: fix an infinite loop in tcp_read_sock() ... Browse Code »

commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers)
added a regression.

[ 83.843570] INFO: rcu_sched self-detected stall on CPU
[ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
[ 83.844582] Task dump for CPU 6:
[ 83.844584] netperf R running task 0 8966 8952 0x0000000c
[ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
[ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
[ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
[ 83.844594] Call Trace:
[ 83.844596] [] ? vprintk_emit+0x1c9/0x4c0
[ 83.844601] [] ? schedule+0x29/0x70
[ 83.844606] [] ? tcp_splice_data_recv+0x42/0x50
[ 83.844610] [] ? tcp_read_sock+0xda/0x260
[ 83.844613] [] ? tcp_prequeue_process+0xb0/0xb0
[ 83.844615] [] ? tcp_splice_read+0xc0/0x250
[ 83.844618] [] ? sock_splice_read+0x22/0x30
[ 83.844622] [] ? do_splice_to+0x7b/0xa0
[ 83.844627] [] ? sys_splice+0x59c/0x5d0
[ 83.844630] [] ? putname+0x2b/0x40
[ 83.844633] [] ? do_sys_open+0x174/0x1e0
[ 83.844636] [] ? system_call_fastpath+0x16/0x1b

if recv_actor() returns 0, we should stop immediately,
because looping wont give a chance to drain the pipe.

Signed-off-by: Eric Dumazet
Cc: Willy Tarreau
Signed-off-by: David S. Miller

Eric Dumazet
2013-01-11 06:07:19 +0800

08 Jan, 2013

2 commits

e239345f6 dmaengine: remove dma_async_memcpy_complete() macro ... Browse Code »

Just use dma_async_is_tx_complete() directly.

Cc: Vinod Koul
Cc: Tomasz Figa
Signed-off-by: Bartlomiej Zolnierkiewicz
Signed-off-by: Kyungmin Park
Signed-off-by: Dan Williams

Bartlomiej Zolnierkiewicz
2013-01-08 14:05:10 +0800
b9ee86830 dmaengine: remove dma_async_memcpy_pending() macro ... Browse Code »

Just use dma_async_issue_pending() directly.

Cc: Vinod Koul
Cc: Tomasz Figa
Signed-off-by: Bartlomiej Zolnierkiewicz
Signed-off-by: Kyungmin Park
Signed-off-by: Dan Williams

Bartlomiej Zolnierkiewicz
2013-01-08 14:05:09 +0800

13 Dec, 2012

1 commit

6be35c700 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking changes from David Miller:

1) Allow to dump, monitor, and change the bridge multicast database
using netlink. From Cong Wang.

2) RFC 5961 TCP blind data injection attack mitigation, from Eric
Dumazet.

3) Networking user namespace support from Eric W. Biederman.

4) tuntap/virtio-net multiqueue support by Jason Wang.

5) Support for checksum offload of encapsulated packets (basically,
tunneled traffic can still be checksummed by HW). From Joseph
Gasparakis.

6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
Daniel Borkmann.

7) Bridge port parameters over netlink and BPDU blocking support
from Stephen Hemminger.

8) Improve data access patterns during inet socket demux by rearranging
socket layout, from Eric Dumazet.

9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
Jon Maloy.

10) Update TCP socket hash sizing to be more in line with current day
realities. The existing heurstics were choosen a decade ago.
From Eric Dumazet.

11) Fix races, queue bloat, and excessive wakeups in ATM and
associated drivers, from Krzysztof Mazur and David Woodhouse.

12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
in VXLAN driver, from David Stevens.

13) Add "oops_only" mode to netconsole, from Amerigo Wang.

14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
allow DCB netlink to work on namespaces other than the initial
namespace. From John Fastabend.

15) Support PTP in the Tigon3 driver, from Matt Carlson.

16) tun/vhost zero copy fixes and improvements, plus turn it on
by default, from Michael S. Tsirkin.

17) Support per-association statistics in SCTP, from Michele
Baldessari.

And many, many, driver updates, cleanups, and improvements. Too
numerous to mention individually.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
net/mlx4_en: Add support for destination MAC in steering rules
net/mlx4_en: Use generic etherdevice.h functions.
net: ethtool: Add destination MAC address to flow steering API
bridge: add support of adding and deleting mdb entries
bridge: notify mdb changes via netlink
ndisc: Unexport ndisc_{build,send}_skb().
uapi: add missing netconf.h to export list
pkt_sched: avoid requeues if possible
solos-pci: fix double-free of TX skb in DMA mode
bnx2: Fix accidental reversions.
bna: Driver Version Updated to 3.1.2.1
bna: Firmware update
bna: Add RX State
bna: Rx Page Based Allocation
bna: TX Intr Coalescing Fix
bna: Tx and Rx Optimizations
bna: Code Cleanup and Enhancements
ath9k: check pdata variable before dereferencing it
ath5k: RX timestamp is reported at end of frame
ath9k_htc: RX timestamp is reported at end of frame
...

Linus Torvalds
2012-12-13 10:07:07 +0800

03 Dec, 2012

1 commit

02275a2ee tcp: don't abort splice() after small transfers ... Browse Code »

TCP coalescing added a regression in splice(socket->pipe) performance,
for some workloads because of the way tcp_read_sock() is implemented.

The reason for this is the break when (offset + 1 != skb->len).

As we released the socket lock, this condition is possible if TCP stack
added a fragment to the skb, which can happen with TCP coalescing.

So let's go back to the beginning of the loop when this happens,
to give a chance to splice more frags per system call.

Doing so fixes the issue and makes GRO 10% faster than LRO
on CPU-bound splice() workloads instead of the opposite.

Signed-off-by: Willy Tarreau
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Willy Tarreau
2012-12-03 09:23:01 +0800

02 Dec, 2012

2 commits

64022d0b4 tcp: fix crashes in do_tcp_sendpages() ... Browse Code »

Recent network changes allowed high order pages being used
for skb fragments.

This uncovered a bug in do_tcp_sendpages() which was assuming its caller
provided an array of order-0 page pointers.

We only have to deal with a single page in this function, and its order
is irrelevant.

Reported-by: Willy Tarreau
Tested-by: Willy Tarreau
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-12-02 09:39:16 +0800
fd90b29d7 tcp: change default tcp hash size ... Browse Code »

As time passed, available memory increased faster than number of
concurrent tcp sockets.

As a result, a machine with 4GB of ram gets a hash table
with 524288 slots, using 8388608 bytes of memory.

Lets change that by a 16x factor (one slot for 128 KB of ram)

Even if a small machine needs a _lot_ of sockets, tcp lookups are now
very efficient, using one cache line per socket.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-12-02 00:36:37 +0800

19 Nov, 2012

1 commit

52e804c6d net: Allow userns root to control ipv4 ... Browse Code »

Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.

In general policy and network stack state changes are allowed
while resource control is left unchanged.

Allow creating raw sockets.
Allow the SIOCSARP ioctl to control the arp cache.
Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting gre tunnels.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipip tunnels.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipsec virtual tunnel interfaces.

Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
sockets.

Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
arbitrary ip options.

Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
Allow setting the IP_TRANSPARENT ipv4 socket option.
Allow setting the TCP_REPAIR socket option.
Allow setting the TCP_CONGESTION socket option.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2012-11-19 09:32:45 +0800

18 Nov, 2012

1 commit

67f4efdce Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Minor line offset auto-merges.

Signed-off-by: David S. Miller

David S. Miller
2012-11-18 11:00:43 +0800

16 Nov, 2012

1 commit

ec3423257 tcp: fix retransmission in repair mode ... Browse Code »

Currently if a socket was repaired with a few packet in a write queue,
a kernel bug may be triggered:

kernel BUG at net/ipv4/tcp_output.c:2330!
RIP: 0010:[] tcp_retransmit_skb+0x5ff/0x610

According to the initial realization v3.4-rc2-963-gc0e88ff,
all skb-s should look like already posted. This patch fixes code
according with this sentence.

Here are three points, which were not done in the initial patch:
1. A tcp send head should not be changed
2. Initialize TSO state of a skb
3. Reset the retransmission time

This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet
passes the ussual way, but isn't sent to network. This patch solves
all described problems and handles tcp_sendpages.

Cc: Pavel Emelyanov
Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Signed-off-by: Andrey Vagin
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Andrew Vagin
2012-11-16 06:44:58 +0800

11 Nov, 2012

1 commit

d4185bbf6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller

David S. Miller
2012-11-11 07:32:51 +0800

23 Oct, 2012

2 commits

6f73601ef tcp: add SYN/data info to TCP_INFO ... Browse Code »

Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options.
It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or
client application can use this information to check Fast Open success rate.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-10-23 03:16:06 +0800
0e71c55c9 tcp: speedup SIOCINQ ioctl ... Browse Code »

SIOCINQ can use the lock_sock_fast() version to avoid double acquisition
of socket lock.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-10-23 02:29:06 +0800

19 Oct, 2012

1 commit

a3374c42a tcp: fix FIONREAD/SIOCINQ ... Browse Code »

tcp_ioctl() tries to take into account if tcp socket received a FIN
to report correct number bytes in receive queue.

But its flaky because if the application ate the last skb,
we return 1 instead of 0.

Correct way to detect that FIN was received is to test SOCK_DONE.

Reported-by: Elliot Hughes
Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Tom Herbert
Signed-off-by: David S. Miller

Eric Dumazet
2012-10-19 03:34:31 +0800

29 Sep, 2012

1 commit

6a06e5e1b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/team/team.c
drivers/net/usb/qmi_wwan.c
net/batman-adv/bat_iv_ogm.c
net/ipv4/fib_frontend.c
net/ipv4/route.c
net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller

David S. Miller
2012-09-29 02:40:49 +0800

25 Sep, 2012

1 commit

5640f7685 net: use a per task frag allocator ... Browse Code »

We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet
Cc: Ben Hutchings
Cc: Vijay Subramanian
Cc: Alexander Duyck
Tested-by: Vijay Subramanian
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-25 04:31:37 +0800

21 Sep, 2012

2 commits

bc26ccd8f tcp: restore rcv_wscale in a repair mode (v2) ... Browse Code »

rcv_wscale is a symetric parameter with snd_wscale.

Both this parameters are set on a connection handshake.

Without this value a remote window size can not be interpreted correctly,
because a value from a packet should be shifted on rcv_wscale.

And one more thing is that wscale_ok should be set too.

This patch doesn't break a backward compatibility.
If someone uses it in a old scheme, a rcv window
will be restored with the same bug (rcv_wscale = 0).

v2: Save backward compatibility on big-endian system. Before
the first two bytes were snd_wscale and the second two bytes were
rcv_wscale. Now snd_wscale is opt_val & 0xFFFF and rcv_wscale >> 16.
This approach is independent on byte ordering.

Cc: David S. Miller
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
CC: Pavel Emelyanov
Signed-off-by: Andrew Vagin
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Andrey Vagin
2012-09-21 05:49:58 +0800
bb68b6472 ipv4: Don't add TCP-code in inet_sock_destruct ... Browse Code »

Signed-off-by: Christoph Paasch
Acked-by: H.K. Jerry Chu
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Christoph Paasch
2012-09-21 05:12:27 +0800

20 Sep, 2012

1 commit

15c041759 tcp: flush DMA queue before sk_wait_data if rcv_wnd is zero ... Browse Code »

If recv() syscall is called for a TCP socket so that
- IOAT DMA is used
- MSG_WAITALL flag is used
- requested length is bigger than sk_rcvbuf
- enough data has already arrived to bring rcv_wnd to zero
then when tcp_recvmsg() gets to calling sk_wait_data(), receive
window can be still zero while sk_async_wait_queue exhausts
enough space to keep it zero. As this queue isn't cleaned until
the tcp_service_net_dma() call, sk_wait_data() cannot receive
any data and blocks forever.

If zero receive window and non-empty sk_async_wait_queue is
detected before calling sk_wait_data(), process the queue first.

Signed-off-by: Michal Kubecek
Signed-off-by: David S. Miller

Michal Kubeček
2012-09-20 04:07:58 +0800

01 Sep, 2012

1 commit

8336886f7 tcp: TCP Fast Open Server - support TFO listeners ... Browse Code »
43

This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu
Cc: Yuchung Cheng
Cc: Neal Cardwell
Cc: Eric Dumazet
Cc: Tom Herbert
Signed-off-by: David S. Miller

Jerry Chu
2012-09-01 08:02:19 +0800

02 Aug, 2012

1 commit

1485348d2 tcp: Apply device TSO segment limit earlier ... Browse Code »

Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to
limit the size of TSO skbs. This avoids the need to fall back to
software GSO for local TCP senders.

Signed-off-by: Ben Hutchings
Signed-off-by: David S. Miller

Ben Hutchings
2012-08-02 15:19:17 +0800