06 Oct, 2014
1 commit
-
This adds a device level support for Geneve -- Generic Network
Virtualization Encapsulation. The protocol is documented at
http://tools.ietf.org/html/draft-gross-geneve-01Only protocol layer Geneve support is provided by this driver.
Openvswitch can be used for configuring, set up and tear down
functional Geneve tunnels.Signed-off-by: Jesse Gross
Signed-off-by: Andy Zhou
Signed-off-by: David S. Miller
29 Sep, 2014
1 commit
-
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switchesThe basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constantThe resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.The algorithm itself has already seen deployments in large production
data centers since then.We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.The senders ran iperf -c -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064Latency for each protocol was computed by running "ping -i 0.2
" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 " was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0As the number of competing flows moves beyond 1, the connection
probability drops rapidly.Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann
Signed-off-by: Florian Westphal
Signed-off-by: Glenn Judd
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller
20 Sep, 2014
1 commit
-
This patch provides a receive path for foo-over-udp. This allows
direct encapsulation of IP protocols over UDP. The bound destination
port is used to map to an IP protocol, and the XFRM framework
(udp_encap_rcv) is used to receive encapsulated packets. Upon
reception, the encapsulation header is logically removed (pointer
to transport header is advanced) and the packet is reinjected into
the receive path with the IP protocol indicated by the mapping.Netlink is used to configure FOU ports. The configuration information
includes the port number to bind to and the IP protocol corresponding
to that port.This should support GRE/UDP
(http://tools.ietf.org/html/draft-yong-tsvwg-gre-in-udp-encap-02),
as will as the other IP tunneling protocols (IPIP, SIT).Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller
15 Jul, 2014
1 commit
-
Added udp_tunnel.c which can contain some common functions for UDP
tunnels. The first function in this is udp_sock_create which is used
to open the listener port for a UDP tunnel.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller
25 Feb, 2014
1 commit
-
This patch add an IPsec protocol multiplexer. With this
it is possible to add alternative protocol handlers as
needed for IPsec virtual tunnel interfaces.Signed-off-by: Steffen Klassert
07 Jan, 2014
1 commit
-
GRO/GSO layers can be enabled on a node, even if said
node is only forwarding packets.This patch permits GSO (and upcoming GRO) support for GRE
encapsulated packets, even if the host has no GRE tunnel setup.Signed-off-by: Eric Dumazet
Cc: H.K. Jerry Chu
Signed-off-by: David S. Miller
04 Jul, 2013
1 commit
-
Similarly to TCP/UDP offloading, move all related GRE functions to
gre_offload.c to make things more explicit and similar to the rest
of the code.Suggested-by: Eric Dumazet
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller
20 Jun, 2013
1 commit
-
Refactor various ip tunnels xmit functions and extend iptunnel_xmit()
so that there is more code sharing.Signed-off-by: Pravin B Shelar
Signed-off-by: David S. Miller
12 Jun, 2013
1 commit
-
Similarly to TCP offloading and UDPv6 offloading, move all related
UDPv4 functions to udp_offload.c to make things more explicit. Also,
by this, we can make those functions static.Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller
08 Jun, 2013
1 commit
-
Would be good to make things explicit and move those functions to
a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c.
While moving all related functions into tcp_offload.c, we can also
make some of them static, since they are only used there. Also, add
an explicit registration function.Suggested-by: Eric Dumazet
Signed-off-by: Daniel Borkmann
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
27 Mar, 2013
1 commit
-
Following patch refactors GRE code into ip tunneling code and GRE
specific code. Common tunneling code is moved to ip_tunnel module.
ip_tunnel module is written as generic library which can be used
by different tunneling implementations.ip_tunnel module contains following components:
- packet xmit and rcv generic code. xmit flow looks like
(gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
- hash table of all devices.
- lookup for tunnel devices.
- control plane operations like device create, destroy, ioctl, netlink
operations code.
- registration for tunneling modules, like gre, ipip etc.
- define single pcpu_tstats dev->tstats.
- struct tnl_ptk_info added to pass parsed tunnel packet parameters.ipip.h header is renamed to ip_tunnel.h
Signed-off-by: Pravin B Shelar
Signed-off-by: David S. Miller
01 Aug, 2012
1 commit
-
Sanity:
CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Aneesh Kumar K.V
Cc: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Jul, 2012
1 commit
-
This patch impelements the common code for both the client and server.
1. TCP Fast Open option processing. Since Fast Open does not have an
option number assigned by IANA yet, it shares the experiment option
code 254 by implementing draft-ietf-tcpm-experimental-options
with a 16 bits magic number 0xF989. This enables global experiments
without clashing the scarce(2) experimental options available for TCP.When the draft status becomes standard (maybe), the client should
switch to the new option number assigned while the server supports
both numbers for transistion.2. The new sysctl tcp_fastopen
3. A place holder init function
Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
19 Jul, 2012
1 commit
-
New VTI tunnel kernel module, Kconfig and Makefile changes.
Signed-off-by: Saurabh Mohan
Reviewed-by: Stephen Hemminger
Signed-off-by: David S. Miller
11 Jul, 2012
1 commit
-
Signed-off-by: David S. Miller
13 Dec, 2011
1 commit
-
This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.Signed-off-by: Glauber Costa
Reviewed-by: KAMEZAWA Hiroyuki
CC: Eric W. Biederman
Signed-off-by: David S. Miller
10 Dec, 2011
1 commit
-
Copy-s/tcp/udp/-paste from TCP bits.
Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller
14 May, 2011
1 commit
-
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/pingFor Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).Signed-off-by: Vasiliy Kulikov
Signed-off-by: David S. Miller
02 Feb, 2011
1 commit
-
The time has finally come to remove the hash based routing table
implementation in ipv4.FIB Trie is mature, well tested, and I've done an audit of it's code
to confirm that it implements insert, delete, and lookup with the same
identical semantics as fib_hash did.If there are any semantic differences found in fib_trie, we should
simply fix them.I've placed the trie statistic config option under advanced router
configuration.Signed-off-by: David S. Miller
Acked-by: Stephen Hemminger
22 Aug, 2010
1 commit
-
PPP: introduce "pptp" module which implements point-to-point tunneling protocol using pppox framework
NET: introduce the "gre" module for demultiplexing GRE packets on version criteria
(required to pptp and ip_gre may coexists)
NET: ip_gre: update to use the "gre" moduleThis patch introduces then pptp support to the linux kernel which
dramatically speeds up pptp vpn connections and decreases cpu usage in
comparison of existing user-space implementation
(poptop/pptpclient). There is accel-pptp project
(https://sourceforge.net/projects/accel-pptp/) to utilize this module,
it contains plugin for pppd to use pptp in client-mode and modified
pptpd (poptop) to build high-performance pptp NAS.There was many changes from initial submitted patch, most important are:
1. using rcu instead of read-write locks
2. using static bitmap instead of dynamically allocated
3. using vmalloc for memory allocation instead of BITS_PER_LONG + __get_free_pages
4. fixed many coding style issues
Thanks to Eric Dumazet.Signed-off-by: Dmitry Kozlov
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
07 Oct, 2008
1 commit
-
Since IPVS now has partial IPv6 support, this patch moves IPVS from
net/ipv4/ipvs to net/netfilter/ipvs. It's a result of:$ git mv net/ipv4/ipvs net/netfilter
and adapting the relevant Kconfigs/Makefiles to the new path.
Signed-off-by: Julius Volz
Signed-off-by: Simon Horman
29 Jan, 2008
1 commit
-
This includes several cleanups:
* tune Makefile to compile out this file when SYSCTL=n. Now
it looks like net/core/sysctl_net_core.c one;
* move the ipv4_config to af_inet.c to exist all the time;
* remove additional sysctl_ip_nonlocal_bind declaration
(it is already declared in net/ip.h);
* remove no nonger needed ifdefs from this file.This is a preparation for using ctl paths for net/ipv4/
sysctl table.Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller
16 Oct, 2007
1 commit
-
There are some objects that are common in all the places
which are used to keep track of frag queues, they are:* hash table
* LRU list
* rw lock
* rnd number for hash function
* the number of queues
* the amount of memory occupied by queues
* secret timerMove all this stuff into one structure (struct inet_frags)
to make it possible use them uniformly in the future. Like
with the previous patch this mostly consists of hunks like- write_lock(&ipfrag_lock);
+ write_lock(&ip4_frags.lock);To address the issue with exporting the number of queues and
the amount of memory occupied by queues outside the .c file
they are declared in, I introduce a couple of helpers.Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller
11 Oct, 2007
1 commit
-
This patch provides generic Large Receive Offload (LRO) functionality
for IPv4/TCP traffic.LRO combines received tcp packets to a single larger tcp packet and
passes them then to the network stack in order to increase performance
(throughput). The interface supports two modes: Drivers can either
pass SKBs or fragment lists to the LRO engine.Signed-off-by: Jan-Bernd Themann
Signed-off-by: David S. Miller
11 Jul, 2007
1 commit
-
With help from Chris Wedgwood.
Signed-off-by: David S. Miller
26 Apr, 2007
4 commits
-
This patch moves the SNMP code shared between IPv4/IPv6 from proc.c
into net/ipv4/af_inet.c. This makes sense because these functions
aren't specific to /proc.As a result we can again skip proc.o if /proc is disabled.
Signed-off-by: Herbert Xu
Acked-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller -
Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller -
This is an implementation of TCP Illinois invented by Shao Liu
at University of Illinois. It is a another variant of Reno which adapts
the alpha and beta parameters based on RTT. The basic idea is to increase
window less rapidly as delay approaches the maximum. See the papers
and talks to get a more complete description.Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller -
YeAH-TCP is a sender-side high-speed enabled TCP congestion control
algorithm, which uses a mixed loss/delay approach to compute the
congestion window. It's design goals target high efficiency, internal,
RTT and Reno fairness, resilience to link loss while keeping network
elements load as low as possible.For further details look here:
http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdfSigned-off-by: Angelo P. Castellani
Signed-off-by: David S. Miller
03 Dec, 2006
1 commit
-
This is a revision of the previously submitted patch, which alters
the way files are organized and compiled in the following manner:* UDP and UDP-Lite now use separate object files
* source file dependencies resolved via header files
net/ipv{4,6}/udp_impl.h
* order of inclusion files in udp.c/udplite.c adapted
accordingly[NET/IPv4]: Support for the UDP-Lite protocol (RFC 3828)
This patch adds support for UDP-Lite to the IPv4 stack, provided as an
extension to the existing UDPv4 code:
* generic routines are all located in net/ipv4/udp.c
* UDP-Lite specific routines are in net/ipv4/udplite.c
* MIB/statistics support in /proc/net/snmp and /proc/net/udplite
* shared API with extensions for partial checksum coverage[NET/IPv6]: Extension for UDP-Lite over IPv6
It extends the existing UDPv6 code base with support for UDP-Lite
in the same manner as per UDPv4. In particular,
* UDPv6 generic and shared code is in net/ipv6/udp.c
* UDP-Litev6 specific extensions are in net/ipv6/udplite.c
* MIB/statistics support in /proc/net/snmp6 and /proc/net/udplite6
* support for IPV6_ADDRFORM
* aligned the coding style of protocol initialisation with af_inet6.c
* made the error handling in udpv6_queue_rcv_skb consistent;
to return `-1' on error on all error cases
* consolidation of shared code[NET]: UDP-Lite Documentation and basic XFRM/Netfilter support
The UDP-Lite patch further provides
* API documentation for UDP-Lite
* basic xfrm support
* basic netfilter support for IPv4 and IPv6 (LOG target)Signed-off-by: Gerrit Renker
Signed-off-by: David S. Miller
04 Oct, 2006
1 commit
-
This patch introduces the BEET mode (Bound End-to-End Tunnel) with as
specified by the ietf draft at the following link:http://www.ietf.org/internet-drafts/draft-nikander-esp-beet-mode-06.txt
The patch provides only single family support (i.e. inner family =
outer family).Signed-off-by: Diego Beltrami
Signed-off-by: Miika Komu
Signed-off-by: Herbert Xu
Signed-off-by: Abhinav Pathak
Signed-off-by: Jeff Ahrenholz
Signed-off-by: David S. Miller
23 Sep, 2006
1 commit
-
Add support for the Commercial IP Security Option (CIPSO) to the IPv4
network stack. CIPSO has become a de-facto standard for
trusted/labeled networking amongst existing Trusted Operating Systems
such as Trusted Solaris, HP-UX CMW, etc. This implementation is
designed to be used with the NetLabel subsystem to provide explicit
packet labeling to LSM developers.The CIPSO/IPv4 packet labeling works by the LSM calling a NetLabel API
function which attaches a CIPSO label (IPv4 option) to a given socket;
this in turn attaches the CIPSO label to every packet leaving the
socket without any extra processing on the outbound side. On the
inbound side the individual packet's sk_buff is examined through a
call to a NetLabel API function to determine if a CIPSO/IPv4 label is
present and if so the security attributes of the CIPSO label are
returned to the caller of the NetLabel API function.Signed-off-by: Paul Moore
Signed-off-by: David S. Miller
11 Jul, 2006
1 commit
-
This reverts: f890f921040fef6a35e39d15b729af1fd1a35f29
The inclusion of TCP Compound needs to be reverted at this time
because it is not 100% certain that this code conforms to the
requirements of Developer's Certificate of Origin 1.1 paragraph (b).Signed-off-by: David S. Miller
18 Jun, 2006
5 commits
-
This adds a new module for tracking TCP state variables non-intrusively
using kprobes. It has a simple /proc interface that outputs one line
for each packet received. A sample usage is to collect congestion
window and ssthresh over time graphs.Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller -
TCP Compound is a sender-side only change to TCP that uses
a mixed Reno/Vegas approach to calculate the cwnd.For further details look here:
ftp://ftp.research.microsoft.com/pub/tr/TR-2005-86.pdfSigned-off-by: Angelo P. Castellani
Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller -
TCP Veno module is a new congestion control module to improve TCP
performance over wireless networks. The key innovation in TCP Veno is
the enhancement of TCP Reno/Sack congestion control algorithm by using
the estimated state of a connection based on TCP Vegas. This scheme
significantly reduces "blind" reduction of TCP window regardless of
the cause of packet loss.This work is based on the research paper "TCP Veno: TCP Enhancement
for Transmission over Wireless Access Networks." C. P. Fu, S. C. Liew,
IEEE Journal on Selected Areas in Communication, Feb. 2003.Original paper and many latest research works on veno:
http://www.ntu.edu.sg/home/ascpfu/veno/veno.htmlSigned-off-by: Bin Zhou
Cheng Peng Fu
Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller -
TCP Low Priority is a distributed algorithm whose goal is to utilize only
the excess network bandwidth as compared to the ``fair share`` of
bandwidth as targeted by TCP. Available from:
http://www.ece.rice.edu/~akuzma/Doc/akuzma/TCP-LP.pdfOriginal Author:
Aleksandar KuzmanovicSee http://www-ece.rice.edu/networks/TCP-LP/ for their implementation.
As of 2.6.13, Linux supports pluggable congestion control algorithms.
Due to the limitation of the API, we take the following changes from
the original TCP-LP implementation:
o We use newReno in most core CA handling. Only add some checking
within cong_avoid.
o Error correcting in remote HZ, therefore remote HZ will be keeped
on checking and updating.
o Handling calculation of One-Way-Delay (OWD) within rtt_sample, sicne
OWD have a similar meaning as RTT. Also correct the buggy formular.
o Handle reaction for Early Congestion Indication (ECI) within
pkts_acked, as mentioned within pseudo code.
o OWD is handled in relative format, where local time stamp will in
tcp_time_stamp format.Port from 2.4.19 to 2.6.16 as module by:
Wong Hoi Sing Edison
Hung Hing LunSigned-off-by: Wong Hoi Sing Edison
Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller -
This patch adds the structure xfrm_mode. It is meant to represent
the operations carried out by transport/tunnel modes.By doing this we allow additional encapsulation modes to be added
without clogging up the xfrm_input/xfrm_output paths.Candidate modes include 4-to-6 tunnel mode, 6-to-4 tunnel mode, and
BEET modes.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
29 Mar, 2006
1 commit
-
Basically this patch moves the generic tunnel protocol stuff out of
xfrm4_tunnel/xfrm6_tunnel and moves it into the new files of tunnel4.c
and tunnel6 respectively.The reason for this is that the problem that Hugo uncovered is only
the tip of the iceberg. The real problem is that when we removed the
dependency of ipip on xfrm4_tunnel we didn't really consider the module
case at all.For instance, as it is it's possible to build both ipip and xfrm4_tunnel
as modules and if the latter is loaded then ipip simply won't load.After considering the alternatives I've decided that the best way out of
this is to restore the dependency of ipip on the non-xfrm-specific part
of xfrm4_tunnel. This is acceptable IMHO because the intention of the
removal was really to be able to use ipip without the xfrm subsystem.
This is still preserved by this patch.So now both ipip/xfrm4_tunnel depend on the new tunnel4.c which handles
the arbitration between the two. The order of processing is determined
by a simple integer which ensures that ipip gets processed before
xfrm4_tunnel.The situation for ICMP handling is a little bit more complicated since
we may not have enough information to determine who it's for. It's not
a big deal at the moment since the xfrm ICMP handlers are basically
no-ops. In future we can deal with this when we look at ICMP caching
in general.The user-visible change to this is the removal of the TUNNEL Kconfig
prompts. This makes sense because it can only be used through IPCOMP
as it stands.The addition of the new modules shouldn't introduce any problems since
module dependency will cause them to be loaded.Oh and I also turned some unnecessary pskb's in IPv6 related to this
patch to skb's.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
11 Jan, 2006
1 commit
-
Don't wrap entire file in #ifdef CONFIG_NETFILTER, remove a few
unneccessary includes.Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller