Doug / smarc-fsl-linux-kernel | Embedian Git Server

25 Aug, 2012

1 commit

7c4a56fec tcp: fix cwnd reduction for non-sack recovery ... Browse Code »

The cwnd reduction in fast recovery is based on the number of packets
newly delivered per ACK. For non-sack connections every DUPACK
signifies a packet has been delivered, but the sender mistakenly
skips counting them for cwnd reduction.

The fix is to compute newly_acked_sacked after DUPACKs are accounted
in sacked_out for non-sack connections.

Signed-off-by: Yuchung Cheng
Acked-by: Nandita Dukkipati
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2012-08-25 01:48:58 +0800

07 Aug, 2012

1 commit

5d299f3d3 net: ipv6: fix TCP early demux ... Browse Code »

IPv6 needs a cookie in dst_check() call.

We need to add rx_dst_cookie and provide a family independent
sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-07 04:33:21 +0800

01 Aug, 2012

2 commits

ac694dbdb Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers

* emailed patches from Andrew Morton : (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...

Linus Torvalds
2012-08-01 10:25:39 +0800
c76562b67 netvm: prevent a stream-specific deadlock ... Browse Code »

This patch series is based on top of "Swap-over-NBD without deadlocking
v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. In diskless systems this is not an option so if swap if
required then swapping over the network is considered. The two likely
scenarios are when blade servers are used as part of a cluster where the
form factor or maintenance costs do not allow the use of disks and thin
clients.

The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap but this is not always an option. There is no
guarantee that the network attached storage (NAS) device is running Linux
or supports NBD. However, it is likely that it supports NFS so there are
users that want support for swapping over NFS despite any performance
concern. Some distributions currently carry patches that support swapping
over NFS but it would be preferable to support it in the mainline kernel.

Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
reserves.

Patch 3 adds three helpers for filesystems to handle swap cache pages.
For example, page_file_mapping() returns page->mapping for
file-backed pages and the address_space of the underlying
swap file for swap cache pages.

Patch 4 adds two address_space_operations to allow a filesystem
to pin all metadata relevant to a swapfile in memory. Upon
successful activation, the swapfile is marked SWP_FILE and
the address space operation ->direct_IO is used for writing
and ->readpage for reading in swap pages.

Patch 5 notes that patch 3 is bolting
filesystem-specific-swapfile-support onto the side and that
the default handlers have different information to what
is available to the filesystem. This patch refactors the
code so that there are generic handlers for each of the new
address_space operations.

Patch 6 adds an API to allow a vector of kernel addresses to be
translated to struct pages and pinned for IO.

Patch 7 adds support for using highmem pages for swap by kmapping
the pages before calling the direct_IO handler.

Patch 8 updates NFS to use the helpers from patch 3 where necessary.

Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

Patch 10 implements the new swapfile-related address_space operations
for NFS and teaches the direct IO handler how to manage
kernel addresses.

Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
where appropriate.

Patch 12 fixes a NULL pointer dereference that occurs when using
swap-over-NFS.

With the patches applied, it is possible to mount a swapfile that is on an
NFS filesystem. Swap performance is not great with a swap stress test
taking roughly twice as long to complete than if the swap device was
backed by NBD.

This patch: netvm: prevent a stream-specific deadlock

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
buffers from receiving data, which will prevent userspace from running,
which is needed to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
this change it applied, it is important that sockets that set
SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
If this happens, a warning is generated and the tokens reclaimed to avoid
accounting errors until the bug is fixed.

[davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Acked-by: David S. Miller
Acked-by: Rik van Riel
Cc: Trond Myklebust
Cc: Neil Brown
Cc: Christoph Hellwig
Cc: Mike Christie
Cc: Eric B Munson
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:47 +0800

31 Jul, 2012

1 commit

404e0a8b6 net: ipv4: fix RCU races on dst refcounts ... Browse Code »

commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
added various fatal races with dst refcounts.

crashes happen on tcp workloads if routes are added/deleted at the same
time.

The dst_free() calls from free_fib_info_rcu() are clearly racy.

We need instead regular dst refcounting (dst_release()) and make
sure dst_release() is aware of RCU grace periods :

Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
before dst destruction for cached dst

Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
to make sure we dont increase a zero refcount (On a dst currently
waiting an rcu grace period before destruction)

rt_cache_route() must take a reference on the new cached route, and
release it if was not able to install it.

With this patch, my machines survive various benchmarks.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-31 05:53:22 +0800

28 Jul, 2012

2 commits

59ea33a68 tcp: perform DMA to userspace only if there is a task waiting for it ... Browse Code »

Back in 2006, commit 1a2449a87b ("[I/OAT]: TCP recv offload to I/OAT")
added support for receive offloading to IOAT dma engine if available.

The code in tcp_rcv_established() tries to perform early DMA copy if
applicable. It however does so without checking whether the userspace
task is actually expecting the data in the buffer.

This is not a problem under normal circumstances, but there is a corner
case where this doesn't work -- and that's when MSG_TRUNC flag to
recvmsg() is used.

If the IOAT dma engine is not used, the code properly checks whether
there is a valid ucopy.task and the socket is owned by userspace, but
misses the check in the dmaengine case.

This problem can be observed in real trivially -- for example 'tbench' is a
good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing
IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they
have been already early-copied in tcp_rcv_established() using dma engine.

This patch introduces the same check we are performing in the simple
iovec copy case to the IOAT case as well. It fixes the indefinite
recvmsg(MSG_TRUNC) hangs.

Signed-off-by: Jiri Kosina
Signed-off-by: David S. Miller

Jiri Kosina
2012-07-28 04:45:51 +0800
505fbcf03 ipv4: fix TCP early demux ... Browse Code »

commit 92101b3b2e317 (ipv4: Prepare for change of rt->rt_iif encoding.)
invalidated TCP early demux, because rx_dst_ifindex is not properly
initialized and checked.

Also remove the use of inet_iif(skb) in favor or skb->skb_iif

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-28 04:45:51 +0800

24 Jul, 2012

1 commit

92101b3b2 ipv4: Prepare for change of rt->rt_iif encoding. ... Browse Code »

Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller

David S. Miller
2012-07-24 07:36:26 +0800

21 Jul, 2012

1 commit

67b95bd78 tcp: Return bool instead of int where appropriate ... Browse Code »

Applied to a set of static inline functions in tcp_input.c

Signed-off-by: Vijay Subramanian
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Vijay Subramanian
2012-07-21 01:59:41 +0800

20 Jul, 2012

4 commits

67da22d23 net-tcp: Fast Open client - cookie-less mode ... Browse Code »

In trusted networks, e.g., intranet, data-center, the client does not
need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
of cookie availability.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800
aab487435 net-tcp: Fast Open client - detecting SYN-data drops ... Browse Code »

On paths with firewalls dropping SYN with data or experimental TCP options,
Fast Open connections will have experience SYN timeout and bad performance.
The solution is to track such incidents in the cookie cache and disables
Fast Open temporarily.

Since only the original SYN includes data and/or Fast Open option, the
SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
such drops. If a path has recurring Fast Open SYN drops, Fast Open is
disabled for 2^(recurring_losses) minutes starting from four minutes up to
roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
it behaves as connect() then write().

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800
8e4178c1c net-tcp: Fast Open client - receiving SYN-ACK ... Browse Code »

On receiving the SYN-ACK after SYN-data, the client needs to
a) update the cached MSS and cookie (if included in SYN-ACK)
b) retransmit the data not yet acknowledged by the SYN-ACK in the final ACK of
the handshake.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800
2100c8d2d net-tcp: Fast Open base ... Browse Code »

This patch impelements the common code for both the client and server.

1. TCP Fast Open option processing. Since Fast Open does not have an
option number assigned by IANA yet, it shares the experiment option
code 254 by implementing draft-ietf-tcpm-experimental-options
with a 16 bits magic number 0xF989. This enables global experiments
without clashing the scarce(2) experimental options available for TCP.

When the draft status becomes standard (maybe), the client should
switch to the new option number assigned while the server supports
both numbers for transistion.

2. The new sysctl tcp_fastopen

3. A place holder init function

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 01:55:36 +0800

19 Jul, 2012

1 commit

e37158991 tcp: refine SYN handling in tcp_validate_incoming ... Browse Code »

Followup of commit 0c24604b68fc (tcp: implement RFC 5961 4.2)

As reported by Vijay Subramanian, we should send a challenge ACK
instead of a dup ack if a SYN flag is set on a packet received out of
window.

This permits the ratelimiting to work as intended, and to increase
correct SNMP counters.

Suggested-by: Vijay Subramanian
Signed-off-by: Eric Dumazet
Acked-by: Vijay Subramanian
Cc: Kiran Kumar Kella
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-19 00:31:25 +0800

17 Jul, 2012

3 commits

0c24604b6 tcp: implement RFC 5961 4.2 ... Browse Code »

Implement the RFC 5691 mitigation against Blind
Reset attack using SYN bit.

Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
incoming packet, instead of resetting the session.

Add a new SNMP counter to count number of challenge acks sent
in response to SYN packets.
(netstat -s | grep TCPSYNChallenge)

Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session
because of a SYN flag.

Signed-off-by: Eric Dumazet
Cc: Kiran Kumar Kella
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-17 22:40:46 +0800
282f23c6e tcp: implement RFC 5961 3.2 ... Browse Code »

Implement the RFC 5691 mitigation against Blind
Reset attack using RST bit.

Idea is to validate incoming RST sequence,
to match RCV.NXT value, instead of previouly accepted
window : (RCV.NXT < RCV.NXT+RCV.WND)

If sequence is in window but not an exact match, send
a "challenge ACK", so that the other part can resend an
RST with the appropriate sequence.

Add a new sysctl, tcp_challenge_ack_limit, to limit
number of challenge ACK sent per second.

Add a new SNMP counter to count number of challenge acks sent.
(netstat -s | grep TCPChallengeACK)

Signed-off-by: Eric Dumazet
Cc: Kiran Kumar Kella
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-17 16:36:20 +0800
a6df1ae93 tcp: add OFO snmp counters ... Browse Code »

Add three SNMP TCP counters, to better track TCP behavior
at global stage (netstat -s), when packets are received
Out Of Order (OFO)

TCPOFOQueue : Number of packets queued in OFO queue

TCPOFODrop : Number of packets meant to be queued in OFO
but dropped because socket rcvbuf limit hit.

TCPOFOMerge : Number of packets in OFO that were merged with
other packets.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-17 13:12:00 +0800

11 Jul, 2012

1 commit

4aabd8ef8 tcp: Move dynamnic metrics handling into seperate file. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 11:31:36 +0800

05 Jul, 2012

1 commit

5110effee net: Do delayed neigh confirmation. ... Browse Code »

When a dst_confirm() happens, mark the confirmation as pending in the
dst. Then on the next packet out, when we have the neigh in-hand, do
the update.

This removes the dependency in dst_confirm() of dst's having an
attached neigh.

While we're here, remove the explicit 'dst' NULL check, all except 2
or 3 call sites ensure it's not NULL. So just fix those cases up.

Signed-off-by: David S. Miller

David S. Miller
2012-07-05 16:03:06 +0800

20 Jun, 2012

1 commit

41063e9dd ipv4: Early TCP socket demux. ... Browse Code »

Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.

But we can optimize this down to one demux for certain kinds of local
sockets.

Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.

If a TCP socket is established then it's identity is fully specified.

This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.

Once we move to established state, we cache the receive packet's input
route to use later.

Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().

Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.

Signed-off-by: David S. Miller

David S. Miller
2012-06-20 12:22:05 +0800

24 May, 2012

1 commit

1ca7ee306 tcp: take care of overlaps in tcp_try_coalesce() ... Browse Code »

Sergio Correia reported following warning :

WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()

WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
"cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);

It appears TCP coalescing, and more specifically commit b081f85c297
(net: implement tcp coalescing in tcp_queue_rcv()) should take care of
possible segment overlaps in receive queue. This was properly done in
the case of out_or_order_queue by the caller.

For example, segment at tail of queue have sequence 1000-2000, and we
add a segment with sequence 1500-2500.
This can happen in case of retransmits.

In this case, just don't do the coalescing.

Reported-by: Sergio Correia
Signed-off-by: Eric Dumazet
Tested-by: Sergio Correia
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-24 12:28:21 +0800

20 May, 2012

1 commit

bad43ca83 net: introduce skb_try_coalesce() ... Browse Code »

Move tcp_try_coalesce() protocol independent part to
skb_try_coalesce().

skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
to build optimized skbs (less sk_buff, and possibly less 'headers')

skb_try_coalesce() is zero copy, unless the copy can fit in destination
header (its a rare case)

kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
because IPv6 will need it in patch (ipv6: use skb coalescing in
reassembly).

Signed-off-by: Eric Dumazet
Cc: Alexander Duyck
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-20 06:34:57 +0800

18 May, 2012

1 commit

a2a385d62 tcp: bool conversions ... Browse Code »

bool conversions where possible.

__inline__ -> inline

space cleanups

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-18 02:59:59 +0800

16 May, 2012

2 commits

91df42bed net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debug ... Browse Code »

Use the current debugging style and enable dynamic_debug.

Signed-off-by: Joe Perches
Signed-off-by: David S. Miller

Joe Perches
2012-05-16 13:01:03 +0800
e87cc4728 net: Convert net_ratelimit uses to net_<level>_ratelimited ... Browse Code »

Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches
Signed-off-by: David S. Miller

Joe Perches
2012-05-16 01:45:03 +0800

11 May, 2012

3 commits

1070b1b83 tcp: Out-line tcp_try_rmem_schedule ... Browse Code »

As proposed by Eric, make the tcp_input.o thinner.

add/remove: 1/1 grow/shrink: 1/4 up/down: 868/-1329 (-461)
function old new delta
tcp_try_rmem_schedule - 864 +864
tcp_ack 4811 4815 +4
tcp_validate_incoming 817 815 -2
tcp_collapse 860 858 -2
tcp_send_rcvq 555 353 -202
tcp_data_queue 3435 3033 -402
tcp_prune_queue 721 - -721

Signed-off-by: Pavel Emelyanov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Pavel Emelyanov
2012-05-11 11:24:36 +0800
3c961afed tcp: Schedule rmem for rcvq repair send ... Browse Code »

As noted by Eric, no checks are performed on the data size we're
putting in the read queue during repair. Thus, validate the given
data size with the common rmem management routine.

Signed-off-by: Pavel Emelyanov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Pavel Emelyanov
2012-05-11 11:24:35 +0800
292e8d8c8 tcp: Move rcvq sending to tcp_input.c ... Browse Code »

It actually works on the input queue and will use its read mem
routines, thus it's better to have in in the tcp_input.c file.

Signed-off-by: Pavel Emelyanov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Pavel Emelyanov
2012-05-11 11:24:35 +0800

08 May, 2012

1 commit

0d6c4a2e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/intel/e1000e/param.c
drivers/net/wireless/iwlwifi/iwl-agn-rx.c
drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
drivers/net/wireless/iwlwifi/iwl-trans.h

Resolved the iwlwifi conflict with mainline using 3-way diff posted
by John Linville and Stephen Rothwell. In 'net' we added a bug
fix to make iwlwifi report a more accurate skb->truesize but this
conflicted with RX path changes that happened meanwhile in net-next.

In e1000e a conflict arose in the validation code for settings of
adapter->itr. 'net-next' had more sophisticated logic so that
logic was used.

Signed-off-by: David S. Miller

David S. Miller
2012-05-08 11:35:40 +0800

04 May, 2012

1 commit

3a7c1ee4a skb: Add skb_head_is_locked helper function ... Browse Code »

This patch adds support for a skb_head_is_locked helper function. It is
meant to be used any time we are considering transferring the head from
skb->head to a paged frag. If the head is locked it means we cannot remove
the head from the skb so it must be copied or we must take the skb as a
whole.

Signed-off-by: Alexander Duyck
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexander Duyck
2012-05-04 01:18:37 +0800

03 May, 2012

10 commits

34a802a5b tcp: move stats merge to the end of tcp_try_coalesce ... Browse Code »

This change cleans up the last bits of tcp_try_coalesce so that we only
need one goto which jumps to the end of the function. The idea is to make
the code more readable by putting things in a linear order so that we start
execution at the top of the function, and end it at the bottom.

I also made a slight tweak to the code for handling frags when we are a
clone. Instead of making it an if (clone) loop else nr_frags = 0 I changed
the logic so that if (!clone) we just set the number of frags to 0 which
disables the for loop anyway.

Signed-off-by: Alexander Duyck
Cc: Eric Dumazet
Cc: Jeff Kirsher
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexander Duyck
2012-05-03 16:21:33 +0800
57b55a7ec tcp: Move code related to head frag in tcp_try_coalesce ... Browse Code »

This change reorders the code related to the use of an skb->head_frag so it
is placed before we check the rest of the frags. This allows the code to
read more linearly instead of like some sort of loop.

Signed-off-by: Alexander Duyck
Cc: Eric Dumazet
Cc: Jeff Kirsher
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexander Duyck
2012-05-03 16:21:33 +0800
c73c3d9c4 tcp: Fix truesize accounting in tcp_try_coalesce ... Browse Code »

This patch addresses several issues in the way we were tracking the
truesize in tcp_try_coalesce.

First it was using ksize which prevents us from having a 0 sized head frag
and getting a usable result. To resolve that this patch uses the end
pointer which is set based off either ksize, or the frag_size supplied in
build_skb. This allows us to compute the original truesize of the entire
buffer and remove that value leaving us with just what was added as pages.

The second issue was the use of skb->len if there is a mergeable head frag.
We should only need to remove the size of an data aligned sk_buff from our
current skb->truesize to compute the delta for a buffer with a reused head.
By using skb->len the value of truesize was being artificially reduced
which means that head frags could use more memory than buffers using
standard allocations.

Signed-off-by: Alexander Duyck
Cc: Eric Dumazet
Cc: Jeff Kirsher
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexander Duyck
2012-05-03 16:21:33 +0800
2996d31f9 net: Stop decapitating clones that have a head_frag ... Browse Code »

This change is meant ot prevent stealing the skb->head to use as a page in
the event that the skb->head was cloned. This allows the other clones to
track each other via shinfo->dataref.

Without this we break down to two methods for tracking the reference count,
one being dataref, the other being the page count. As a result it becomes
difficult to track how many references there are to skb->head.

Signed-off-by: Alexander Duyck
Cc: Eric Dumazet
Cc: Jeff Kirsher
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexander Duyck
2012-05-03 13:34:37 +0800
b081f85c2 net: implement tcp coalescing in tcp_queue_rcv() ... Browse Code »

Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
receiver function when application is not blocked in recvmsg().

Function tcp_queue_rcv() is moved a bit to allow its call from
tcp_data_queue()

This gives good results especially if GRO could not kick, and if skb
head is a fragment.

Signed-off-by: Eric Dumazet
Cc: Alexander Duyck
Cc: Neal Cardwell
Cc: Tom Herbert
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-03 09:11:11 +0800
923dd347b net: take care of cloned skbs in tcp_try_coalesce() ... Browse Code »

Before stealing fragments or skb head, we must make sure skbs are not
cloned.

Alexander was worried about destination skb being cloned : In bridge
setups, a driver could be fooled if skb->data_len would not match skb
nr_frags.

If source skb is cloned, we must take references on pages instead.

Bug happened using tcpdump (if not using mmap())

Introduce kfree_skb_partial() helper to cleanup code.

Reported-by: Alexander Duyck
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-03 09:11:11 +0800
b49960a05 tcp: change tcp_adv_win_scale and tcp_rmem[2] ... Browse Code »

tcp_adv_win_scale default value is 2, meaning we expect a good citizen
skb to have skb->len / skb->truesize ratio of 75% (3/4)

In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
So these skbs were considered as not bloated.

With recent truesize fixes, a typical MSS=1460 frame truesize is now the
more precise :
2048 + 256 = 2304. But 2304 * 3/4 = 1728.
So these skb are not good citizen anymore, because 1460 < 1728

(GRO can escape this problem because it build skbs with a too low
truesize.)

This also means tcp advertises a too optimistic window for a given
allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
especially when application is slow to drain its receive queue or in
case of losses (netperf is fast, scp is slow). This is a major latency
source.

We should adjust the len/truesize ratio to 50% instead of 75%

This patch :

1) changes tcp_adv_win_scale default to 1 instead of 2

2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
better truesize tracking and to allow autotuning tcp receive window to
reach same value than before. Note that same amount of kernel memory is
consumed compared to 2.6 kernels.

Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Tom Herbert
Cc: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-03 09:08:58 +0800
750ea2baf tcp: early retransmit: delayed fast retransmit ... Browse Code »

Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed. When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2012-05-03 08:56:10 +0800
eed530b6c tcp: early retransmit ... Browse Code »

This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.

While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. Currently
ER is conservative and is disabled for the rest of the connection
after the first reordering event. A large scale web server
experiment on the performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.

ER is enabled by sysctl_tcp_early_retrans:
0: Disables ER

1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

2: (Default) reduce dupthresh like mode 1. In addition, delay
entering fast recovery by RTT/4.

Note: mode 2 is implemented in the third part of this patch series.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2012-05-03 08:56:10 +0800
1fbc34051 tcp: early retransmit: tcp_enter_recovery() ... Browse Code »

This a prepartion patch that refactors the code to enter recovery
into a new function tcp_enter_recovery(). It's needed to implement
the delayed fast retransmit in ER.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2012-05-03 08:56:09 +0800