23 Apr, 2013
1 commit
-
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
drivers/net/ethernet/intel/igb/igb_main.c
drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
include/net/scm.h
net/batman-adv/routing.c
net/ipv4/tcp_input.cThe e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.Signed-off-by: David S. Miller
17 Apr, 2013
1 commit
-
Commit 4a94445c9a5c (net: Use ip_route_input_noref() in input path)
added a bug in IP defragmentation handling, as non refcounted
dst could escape an RCU protected section.Commit 64f3b9e203bd068 (net: ip_expire() must revalidate route) fixed
the case of timeouts, but not the general problem.Tom Parkin noticed crashes in UDP stack and provided a patch,
but further analysis permitted us to pinpoint the root cause.Before queueing a packet into a frag list, we must drop its dst,
as this dst has limited lifetime (RCU protected)When/if a packet is finally reassembled, we use the dst of the very
last skb, still protected by RCU and valid, as the dst of the
reassembled packet.Use same logic in IPv6, as there is no need to hold dst references.
Reported-by: Tom Parkin
Tested-by: Tom Parkin
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
25 Mar, 2013
1 commit
-
This patch just moves some code arround to make the ip4_frag_ecn_table
and IPFRAG_ECN_* constants accessible from the other reassembly engines. I
also renamed ip4_frag_ecn_table to ip_frag_ecn_table.Cc: Eric Dumazet
Cc: Jesper Dangaard Brouer
Cc: YOSHIFUJI Hideaki
Signed-off-by: Hannes Frederic Sowa
Acked-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller
19 Mar, 2013
1 commit
-
This patch introduces a constant limit of the fragment queue hash
table bucket list lengths. Currently the limit 128 is choosen somewhat
arbitrary and just ensures that we can fill up the fragment cache with
empty packets up to the default ip_frag_high_thresh limits. It should
just protect from list iteration eating considerable amounts of cpu.If we reach the maximum length in one hash bucket a warning is printed.
This is implemented on the caller side of inet_frag_find to distinguish
between the different users of inet_fragment.c.I dropped the out of memory warning in the ipv4 fragment lookup path,
because we already get a warning by the slab allocator.Cc: Eric Dumazet
Cc: Jesper Dangaard Brouer
Signed-off-by: Hannes Frederic Sowa
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
16 Feb, 2013
1 commit
-
This function will be used in next GRE_GSO patch. This patch does
not change any functionality.Signed-off-by: Pravin B Shelar
Acked-by: Eric Dumazet
30 Jan, 2013
2 commits
-
Updating the fragmentation queues LRU (Least-Recently-Used) list,
required taking the hash writer lock. However, the LRU list isn't
tied to the hash at all, so we can use a separate lock for it.Original-idea-by: Florian Westphal
Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller -
This change is primarily a preparation to ease the extension of memory
limit tracking.The change does reduce the number atomic operation, during freeing of
a frag queue. This does introduce a some performance improvement, as
these atomic operations are at the core of the performance problems
seen on NUMA systems.Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller
18 Jan, 2013
1 commit
-
Increase the amount of memory usage limits for incomplete
IP fragments.Arguing for new thresh high/low values:
High threshold = 4 MBytes
Low threshold = 3 MBytesThe fragmentation memory accounting code, tries to account for the
real memory usage, by measuring both the size of frag queue struct
(inet_frag_queue (ipv4:ipq/ipv6:frag_queue)) and the SKB's truesize.We want to be able to handle/hold-on-to enough fragments, to ensure
good performance, without causing incomplete fragments to hurt
scalability, by causing the number of inet_frag_queue to grow too much
(resulting longer searches for frag queues).For IPv4, how much memory does the largest frag consume.
Maximum size fragment is 64K, which is approx 44 fragments with
MTU(1500) sized packets. Sizeof(struct ipq) is 200. A 1500 byte
packet results in a truesize of 2944 (not 2048 as I first assumed)(44*2944)+200 = 129736 bytes
The current default high thresh of 262144 bytes, is obviously
problematic, as only two 64K fragments can fit in the queue at the
same time.How many 64K fragment can we fit into 4 MBytes:
4*2^20/((44*2944)+200) = 32.34 fragment in queues
An attacker could send a separate/distinct fake fragment packets per
queue, causing us to allocate one inet_frag_queue per packet, and thus
attacking the hash table and its lists.How many frag queue do we need to store, and given a current hash size
of 64, what is the average list length.Using one MTU sized fragment per inet_frag_queue, each consuming
(2944+200) 3144 bytes.4*2^20/(2944+200) = 1334 frag queues -> 21 avg list length
An attack could send small fragments, the smallest packet I could send
resulted in a truesize of 896 bytes (I'm a little surprised by this).4*2^20/(896+200) = 3827 frag queues -> 59 avg list length
When increasing these number, we also need to followup with
improvements, that is going to help scalability. Simply increasing
the hash size, is not enough as the current implementation does not
have a per hash bucket locking.Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller
13 Dec, 2012
1 commit
-
Pull networking changes from David Miller:
1) Allow to dump, monitor, and change the bridge multicast database
using netlink. From Cong Wang.2) RFC 5961 TCP blind data injection attack mitigation, from Eric
Dumazet.3) Networking user namespace support from Eric W. Biederman.
4) tuntap/virtio-net multiqueue support by Jason Wang.
5) Support for checksum offload of encapsulated packets (basically,
tunneled traffic can still be checksummed by HW). From Joseph
Gasparakis.6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
Daniel Borkmann.7) Bridge port parameters over netlink and BPDU blocking support
from Stephen Hemminger.8) Improve data access patterns during inet socket demux by rearranging
socket layout, from Eric Dumazet.9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
Jon Maloy.10) Update TCP socket hash sizing to be more in line with current day
realities. The existing heurstics were choosen a decade ago.
From Eric Dumazet.11) Fix races, queue bloat, and excessive wakeups in ATM and
associated drivers, from Krzysztof Mazur and David Woodhouse.12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
in VXLAN driver, from David Stevens.13) Add "oops_only" mode to netconsole, from Amerigo Wang.
14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
allow DCB netlink to work on namespaces other than the initial
namespace. From John Fastabend.15) Support PTP in the Tigon3 driver, from Matt Carlson.
16) tun/vhost zero copy fixes and improvements, plus turn it on
by default, from Michael S. Tsirkin.17) Support per-association statistics in SCTP, from Michele
Baldessari.And many, many, driver updates, cleanups, and improvements. Too
numerous to mention individually.* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
net/mlx4_en: Add support for destination MAC in steering rules
net/mlx4_en: Use generic etherdevice.h functions.
net: ethtool: Add destination MAC address to flow steering API
bridge: add support of adding and deleting mdb entries
bridge: notify mdb changes via netlink
ndisc: Unexport ndisc_{build,send}_skb().
uapi: add missing netconf.h to export list
pkt_sched: avoid requeues if possible
solos-pci: fix double-free of TX skb in DMA mode
bnx2: Fix accidental reversions.
bna: Driver Version Updated to 3.1.2.1
bna: Firmware update
bna: Add RX State
bna: Rx Page Based Allocation
bna: TX Intr Coalescing Fix
bna: Tx and Rx Optimizations
bna: Code Cleanup and Enhancements
ath9k: check pdata variable before dereferencing it
ath5k: RX timestamp is reported at end of frame
ath9k_htc: RX timestamp is reported at end of frame
...
11 Dec, 2012
1 commit
-
ip_check_defrag() might be called from af_packet within the
RX path where shared SKBs are used, so it must not modify
the input SKB before it has unshared it for defragmentation.
Use skb_copy_bits() to get the IP header and only pull in
everything later.The same is true for the other caller in macvlan as it is
called from dev->rx_handler which can also get a shared SKB.Reported-by: Eric Leblond
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller
19 Nov, 2012
1 commit
-
In preparation for supporting the creation of network namespaces
by unprivileged users, modify all of the per net sysctl exports
and refuse to allow them to unprivileged users.This makes it safe for unprivileged users in general to access
per net sysctls, and allows sysctls to be exported to unprivileged
users on an individual basis as they are deemed safe.Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller
20 Sep, 2012
1 commit
-
Cc: Herbert Xu
Cc: Michal Kubeček
Cc: David Miller
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller
27 Aug, 2012
1 commit
-
IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and
(in case of forwarded packets) refragments them at POST_ROUTING
independent of the IP_DF flag. Refragmentation uses the dst_mtu() of
the local route without caring about the original fragment sizes,
thereby breaking PMTUD.This patch fixes this by keeping track of the largest received fragment
with IP_DF set and generates an ICMP fragmentation required error during
refragmentation if that size exceeds the MTU.Signed-off-by: Patrick McHardy
Acked-by: Eric Dumazet
Acked-by: David S. Miller
27 Jul, 2012
1 commit
-
With the routing cache removal we lost the "noref" code paths on
input, and this can kill some routing workloads.Reinstate the noref path when we hit a cached route in the FIB
nexthops.With help from Eric Dumazet.
Reported-by: Alexander Duyck
Signed-off-by: David S. Miller
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
21 Jul, 2012
1 commit
-
The "noref" argument to ip_route_input_common() is now always ignored
because we do not cache routes, and in that case we must always grab
a reference to the resulting 'dst'.Signed-off-by: David S. Miller
28 Jun, 2012
2 commits
-
This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7.
This change has several unwanted side effects:
1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
thus never create a real cached route.2) All TCP traffic will use DST_NOCACHE and never use the routing
cache at all.Signed-off-by: David S. Miller
-
DDOS synflood attacks hit badly IP route cache.
On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.rt_garbage_collect() triggers and tries to cleanup things.
Eventually route cache is disabled but machine is under fire and might
OOM and crash.This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.Signed-off-by: Eric Dumazet
Cc: Hans Schillstrom
Signed-off-by: David S. Miller
10 Jun, 2012
1 commit
-
Otherwise we reference potentially non-existing members when
ipv6 is disabled.Signed-off-by: David S. Miller
09 Jun, 2012
1 commit
-
add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.and modify some places to provide net for inet_getpeer_v[4,6]
Signed-off-by: Gao feng
Signed-off-by: David S. Miller
20 May, 2012
1 commit
-
ip_frag_reasm() can use skb_try_coalesce() to build optimized skb,
reducing memory used by them (truesize), and reducing number of cache
line misses and overhead for the consumer.Signed-off-by: Eric Dumazet
Cc: Alexander Duyck
Signed-off-by: David S. Miller
18 May, 2012
1 commit
-
- match() method returns a boolean
- return (A && B && C && D) -> return A && B && C && D
- fix indentationSigned-off-by: Eric Dumazet
16 May, 2012
1 commit
-
Standardize the net core ratelimited logging functions.
Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.Signed-off-by: Joe Perches
Signed-off-by: David S. Miller
21 Apr, 2012
2 commits
-
This results in code with less boiler plate that is a bit easier
to read.Additionally stops us from using compatibility code in the sysctl
core, hastening the day when the compatibility code can be removed.Signed-off-by: Eric W. Biederman
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller -
register_sysctl_rotable never caught on as an interesting way to
register sysctls. My take on the situation is that what we want are
sysctls that we can only see in the initial network namespace. What we
have implemented with register_sysctl_rotable are sysctls that we can
see in all of the network namespaces and can only change in the initial
network namespace.That is a very silly way to go. Just register the network sysctls
in the initial network namespace and we don't have any weird special
cases to deal with.The sysctls affected are:
/proc/sys/net/ipv4/ipfrag_secret_interval
/proc/sys/net/ipv4/ipfrag_max_dist
/proc/sys/net/ipv6/ip6frag_secret_interval
/proc/sys/net/ipv6/mld_max_msfI really don't expect anyone will miss them if they can't read them in a
child user namespace.CC: Pavel Emelyanov
Signed-off-by: Eric W. Biederman
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller
20 Apr, 2012
1 commit
-
When defragmentation is finalized, we clone a packet and kfree_skb() it.
Call consume_skb() to not confuse dropwatch, since its not a drop.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
13 Mar, 2012
1 commit
-
Add #define pr_fmt(fmt) as appropriate.
Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.Add missing "\n" to pr_info in gre.c.
Signed-off-by: Joe Perches
Signed-off-by: David S. Miller
12 Mar, 2012
1 commit
-
Use a more current kernel messaging style.
Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.Some messages that were prefixed with _close are
now prefixed with _fini. Some ah4 and esp messages
are now not prefixed with "ip ".The intent of this patch is to later add something like
#define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.Text size is trivially reduced. (x86-32 allyesconfig)
$ size net/ipv4/built-in.o*
text data bss dec hex filename
887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.oldSigned-off-by: Joe Perches
Signed-off-by: David S. Miller
02 Dec, 2011
1 commit
-
The below patch fixes some typos in various parts of the kernel, as well as fixes some comments.
Please let me know if I missed anything, and I will try to get it changed and resent.Signed-off-by: Justin P. Mattock
Acked-by: Randy Dunlap
Signed-off-by: Jiri Kosina
19 Oct, 2011
2 commits
-
To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Fragmented multicast frames are delivered to a single macvlan port,
because ip defrag logic considers other samples are redundant.Implement a defrag step before trying to send the multicast frame.
Reported-by: Ben Greear
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
06 Jul, 2011
1 commit
-
Elide the ICMP on frag queue timeouts unconditionally for
this user.Signed-off-by: David S. Miller
18 May, 2011
1 commit
-
Noticed by Joe Perches.
Signed-off-by: David S. Miller
17 May, 2011
1 commit
-
Commit 6623e3b24a5e (ipv4: IP defragmentation must be ECN aware) was an
attempt to not lose "Congestion Experienced" (CE) indications when
performing datagram defragmentation.Stefanos Harhalakis raised the point that RFC 3168 requirements were not
completely met by this commit.In particular, we MUST detect invalid combinations and eventually drop
illegal frames.Reported-by: Stefanos Harhalakis
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
05 May, 2011
1 commit
-
Commit 4a94445c9a5c (net: Use ip_route_input_noref() in input path)
added a bug in IP defragmentation handling, in case timeout is fired.When a frame is defragmented, we use last skb dst field when building
final skb. Its dst is valid, since we are in rcu read section.But if a timeout occurs, we take first queued fragment to build one ICMP
TIME EXCEEDED message. Problem is all queued skb have weak dst pointers,
since we escaped RCU critical section after their queueing. icmp_send()
might dereference a now freed (and possibly reused) part of memory.Calling skb_dst_drop() and ip_route_input_noref() to revalidate route is
the only possible choice.Reported-by: Denys Fedoryshchenko
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
07 Jan, 2011
1 commit
-
RFC3168 (The Addition of Explicit Congestion Notification to IP)
states :5.3. Fragmentation
ECN-capable packets MAY have the DF (Don't Fragment) bit set.
Reassembly of a fragmented packet MUST NOT lose indications of
congestion. In other words, if any fragment of an IP packet to be
reassembled has the CE codepoint set, then one of two actions MUST be
taken:* Set the CE codepoint on the reassembled packet. However, this
MUST NOT occur if any of the other fragments contributing to
this reassembly carries the Not-ECT codepoint.* The packet is dropped, instead of being reassembled, for any
other reason.This patch implements this requirement for IPv4, choosing the first
action :If one fragment had NO-ECT codepoint
reassembled frame has NO-ECT
ElIf one fragment had CE codepoint
reassembled frame has CESigned-off-by: Eric Dumazet
Signed-off-by: David S. Miller
01 Dec, 2010
1 commit
-
And make an inet_getpeer_v4() helper, update callers.
Signed-off-by: David S. Miller
24 Sep, 2010
1 commit
-
Change "return (EXPR);" to "return EXPR;"
return is not a function, parentheses are not required.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
23 Aug, 2010
1 commit
-
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.Signed-off-by: David S. Miller
13 Jul, 2010
1 commit
-
CodingStyle cleanups
EXPORT_SYMBOL should immediately follow the symbol declaration.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
01 Jul, 2010
1 commit
-
add fast path for in-order fragments
As the fragments are sent in order in most of OSes, such as Windows, Darwin and
FreeBSD, it is likely the new fragments are at the end of the inet_frag_queue.
In the fast path, we check if the skb at the end of the inet_frag_queue is the
prev we expect.Signed-off-by: Changli Gao
----
include/net/inet_frag.h | 1 +
net/ipv4/ip_fragment.c | 12 ++++++++++++
net/ipv6/reassembly.c | 11 +++++++++++
3 files changed, 24 insertions(+)
Signed-off-by: David S. Miller