Eric Lee / smarc-fsl-linux-kernel

31 Jan, 2019

1 commit

66a011d15 net: Fix usage of pskb_trim_rcsum ... Browse Code »

[ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ]

In certain cases, pskb_trim_rcsum() may change skb pointers.
Reinitialize header pointers afterwards to avoid potential
use-after-frees. Add a note in the documentation of
pskb_trim_rcsum(). Found by KASAN.

Signed-off-by: Ross Lagerwall
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Ross Lagerwall
2019-01-31 15:13:41 +0800

01 Oct, 2017

1 commit

7487449c8 IPv4: early demux can return an error code ... Browse Code »

Currently no error is emitted, but this infrastructure will
used by the next patch to allow source address validation
for mcast sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2017-10-01 10:55:47 +0800

25 Mar, 2017

1 commit

dddb64bcb net: Add sysctl to toggle early demux for tcp and udp ... Browse Code »

Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.

By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -> 788Mbps unconnected single stream UDPv4
633 -> 654Mbps unconnected UDPv4 different sources

The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.

Both sysctls are enabled by default to preserve existing behavior.

v1->v2: Change function pointer instead of adding conditional as
suggested by Stephen.

v2->v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.

v3->v4: Store and use read once result instead of querying pointer
again incorrectly.

v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}

Signed-off-by: Subash Abhinov Kasiviswanathan
Suggested-by: Eric Dumazet
Cc: Stephen Hemminger
Cc: Tom Herbert
Cc: David Miller
Signed-off-by: David S. Miller

subashab@codeaurora.org
2017-03-25 04:17:07 +0800

16 Sep, 2016

1 commit

d6f64d725 net: VRF: Pass original iif to ip_route_input() ... Browse Code »

The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except
the global VRF, this replaces skb->dev with the VRF master interface.
When calling ip_route_input_noref() from here, the checks for forwarding
look at this master device instead of the initial ingress interface.
This will allow packets to be routed which normally would be dropped.
For example, an interface that is not assigned an IP address should
drop packets, but because the checking is against the master device, the
packet will be forwarded.

The fix here is to still call l3mdev_ip_rcv(), but remember the initial
net_device. This is passed to the other functions within ip_rcv_finish,
so they still see the original interface.

Signed-off-by: Mark Tomlinson
Acked-by: David Ahern
Signed-off-by: David S. Miller

Mark Tomlinson
2016-09-16 16:24:07 +0800

12 May, 2016

2 commits

0b922b7a8 net: original ingress device index in PKTINFO ... Browse Code »

Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.

The previous patch moving the skb->dev change to L3 means nothing else
is needed for IPv6; it just works.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-05-12 07:31:40 +0800
74b20582a net: l3mdev: Add hook in ip and ipv6 ... Browse Code »

Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.

This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.

dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-05-12 07:31:40 +0800

28 Apr, 2016

4 commits

02a1d6e7a net: rename NET_{ADD|INC}_STATS_BH() ... Browse Code »

Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800
b15084ec7 net: rename IP_UPD_PO_STATS_BH() ... Browse Code »

Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800
98f619957 net: rename IP_ADD_STATS_BH() ... Browse Code »

Rename IP_ADD_STATS_BH() to __IP_ADD_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800
b45386efa net: rename IP_INC_STATS_BH() ... Browse Code »

Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:23 +0800

17 Feb, 2016

1 commit

e21145a98 ipv4: namespacify ip_early_demux sysctl knob ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller

Nikolay Borisov
2016-02-17 09:42:54 +0800

11 Feb, 2016

1 commit

12b74dfad ipv4: add option to drop unicast encapsulated in L2 multicast ... Browse Code »

In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv4 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.

Additionally, enabling this option provides compliance with a SHOULD
clause of RFC 1122.

Reviewed-by: Julian Anastasov
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2016-02-11 17:27:35 +0800

30 Jan, 2016

1 commit

63e51b6a2 ipv4: early demux should be aware of fragments ... Browse Code »

We should not assume a valid protocol header is present,
as this is not the case for IPv4 fragments.

Lets avoid extra cache line misses and potential bugs
if we actually find a socket and incorrectly uses its dst.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-01-30 04:14:20 +0800

13 Oct, 2015

2 commits

19bcf9f20 ipv4: Pass struct net into ip_defrag and ip_check_defrag ... Browse Code »

The function ip_defrag is called on both the input and the output
paths of the networking stack. In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.

So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.

Signed-off-by: "Eric W. Biederman"
Acked-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller

Eric W. Biederman
2015-10-13 10:44:16 +0800
37fcbab61 ipv4: Only compute net once in ip_call_ra_chain ... Browse Code »

ip_call_ra_chain is called early in the forwarding chain from
ip_forward and ip_mr_input, which makes skb->dev the correct
expression to get the input network device and dev_net(skb->dev) a
correct expression for the network namespace the packet is being
processed in.

Compute the network namespace and store it in a variable to make the
code clearer.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2015-10-13 10:44:14 +0800

18 Sep, 2015

4 commits

0c4b51f00 netfilter: Pass net into okfn ... Browse Code »

This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.

As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.

To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2015-09-18 08:18:37 +0800
29a26a568 netfilter: Pass struct net into the netfilter hooks ... Browse Code »

Pass a network namespace parameter into the netfilter hooks. At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.

This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".

In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.

The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)". I am documenting them in case something odd
pops up and someone starts trying to track down what happened.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2015-09-18 08:18:37 +0800
38184b3b0 ipv4: Only compute net once in ip_rcv_finish ... Browse Code »

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2015-09-18 08:18:34 +0800
e707766ce ipv4: Compute net once in ip_rcv ... Browse Code »

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2015-09-18 08:18:33 +0800

22 Jul, 2015

1 commit

f38a9eb1f dst: Metadata destinations ... Browse Code »

Introduces a new dst_metadata which enables to carry per packet metadata
between forwarding and processing elements via the skb->dst pointer.

The structure is set up to be a union. Thus, each separate type of
metadata requires its own dst instance. If demand arises to carry
multiple types of metadata concurrently, metadata dst entries can be
made stackable.

The metadata dst entry is refcnt'ed as expected for now but a non
reference counted use is possible if the reference is forced before
queueing the skb.

In order to allow allocating dsts with variable length, the existing
dst_alloc() is split into a dst_alloc() and dst_init() function. The
existing dst_init() function to initialize the subsystem is being
renamed to dst_subsys_init() to make it clear what is what.

The check before ip_route_input() is changed to ignore metadata dsts
and drop the dst inside the routing function thus allowing to interpret
metadata in a later commit.

Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller

Thomas Graf
2015-07-22 01:39:05 +0800

08 Apr, 2015

1 commit

7026b1ddb netfilter: Pass socket pointer down through okfn(). ... Browse Code »

On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.

And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.

We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.

The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.

Signed-off-by: David S. Miller

David Miller
2015-04-08 03:25:55 +0800

04 Apr, 2015

2 commits

00db41243 ipv4: coding style: comparison for inequality with NULL ... Browse Code »

The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris
Signed-off-by: David S. Miller

Ian Morris
2015-04-04 00:11:15 +0800
51456b291 ipv4: coding style: comparison for equality with NULL ... Browse Code »

The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris
Signed-off-by: David S. Miller

Ian Morris
2015-04-04 00:11:15 +0800

28 Jan, 2014

1 commit

a452ce345 net: Fix memory leak if TPROXY used with TCP early demux ... Browse Code »

I see a memory leak when using a transparent HTTP proxy using TPROXY
together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):

unreferenced object 0xffff88008cba4a40 (size 1696):
comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
hex dump (first 32 bytes):
0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2.....
02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] kmem_cache_alloc+0xad/0xb9
[] sk_prot_alloc+0x29/0xc5
[] sk_clone_lock+0x14/0x283
[] inet_csk_clone_lock+0xf/0x7b
[] netlink_broadcast+0x14/0x16
[] tcp_create_openreq_child+0x1b/0x4c3
[] tcp_v4_syn_recv_sock+0x38/0x25d
[] tcp_check_req+0x25c/0x3d0
[] tcp_v4_do_rcv+0x287/0x40e
[] ip_route_input_noref+0x843/0xa55
[] tcp_v4_rcv+0x4c9/0x725
[] ip_local_deliver_finish+0xe9/0x154
[] __netif_receive_skb+0x4b2/0x514
[] process_backlog+0xee/0x1c5
[] net_rx_action+0xa7/0x200
[] add_interrupt_randomness+0x39/0x157

But there are many more, resulting in the machine going OOM after some
days.

From looking at the TPROXY code, and with help from Florian, I see
that the memory leak is introduced in tcp_v4_early_demux():

void tcp_v4_early_demux(struct sk_buff *skb)
{
/* ... */

iph = ip_hdr(skb);
th = tcp_hdr(skb);

if (th->doff < sizeof(struct tcphdr) / 4)
return;

sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
iph->saddr, th->source,
iph->daddr, ntohs(th->dest),
skb->skb_iif);
if (sk) {
skb->sk = sk;

where the socket is assigned unconditionally to skb->sk, also bumping
the refcnt on it. This is problematic, because in our case the skb
has already a socket assigned in the TPROXY target. This then results
in the leak I see.

The very same issue seems to be with IPv6, but haven't tested.

Reviewed-by: Florian Westphal
Signed-off-by: Holger Eitzenberger
Signed-off-by: David S. Miller

Holger Eitzenberger
2014-01-28 08:22:11 +0800

09 Aug, 2013

1 commit

1f07d03e2 net: add SNMP counters tracking incoming ECN bits ... Browse Code »

With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
counters do not count the number of frames, but number of aggregated
segments.

Its probably too late to change this now.

This patch adds four new counters, tracking number of frames, regardless
of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.

Ip[6]NoECTPkts : Number of packets received with NOECT
Ip[6]ECT1Pkts : Number of packets received with ECT(1)
Ip[6]ECT0Pkts : Number of packets received with ECT(0)
Ip[6]CEPkts : Number of packets received with Congestion Experienced

lph37:~# nstat | egrep "Pkts|InReceive"
IpInReceives 1634137 0.0
Ip6InReceives 3714107 0.0
Ip6InNoECTPkts 19205 0.0
Ip6InECT0Pkts 52651828 0.0
IpExtInNoECTPkts 33630 0.0
IpExtInECT0Pkts 15581379 0.0
IpExtInCEPkts 6 0.0

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-08-09 13:24:59 +0800

17 Jul, 2013

1 commit

21d1196a3 ipv4: set transport header earlier ... Browse Code »

commit 45f00f99d6e ("ipv4: tcp: clean up tcp_v4_early_demux()") added a
performance regression for non GRO traffic, basically disabling
IP early demux.

IPv6 stack resets transport header in ip6_rcv() before calling
IP early demux in ip6_rcv_finish(), while IPv4 does this only in
ip_local_deliver_finish(), _after_ IP early demux.

GRO traffic happened to enable IP early demux because transport header
is also set in inet_gro_receive()

Instead of reverting the faulty commit, we can make IPv4/IPv6 behave the
same : transport_header should be set in ip_rcv() instead of
ip_local_deliver_finish()

ip_local_deliver_finish() can also use skb_network_header_len() which is
faster than ip_hdrlen()

Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Tom Herbert
Signed-off-by: David S. Miller

Eric Dumazet
2013-07-17 03:59:28 +0800

30 Apr, 2013

1 commit

6a5dc9e59 net: Add MIB counters for checksum errors ... Browse Code »

Add MIB counters for checksum errors in IP layer,
and TCP/UDP/ICMP layers, to help diagnose problems.

$ nstat -a | grep Csum
IcmpInCsumErrors 72 0.0
TcpInCsumErrors 382 0.0
UdpInCsumErrors 463221 0.0
Icmp6InCsumErrors 75 0.0
Udp6InCsumErrors 173442 0.0
IpExtInCsumErrors 10884 0.0

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-04-30 03:14:03 +0800

02 Mar, 2013

1 commit

d8c6f4b9b ipv[4|6]: correct dropwatch false positive in local_deliver_finish ... Browse Code »

I had a report recently of a user trying to use dropwatch to localise some frame
loss, and they were getting false positives. Turned out they were using a user
space SCTP stack that used raw sockets to grab frames. When we don't have a
registered protocol for a given packet, we record it as a drop, even if a raw
socket receieves the frame. We should only record the drop in the event a raw
socket doesnt exist to receive the frames

Tested by the reported successfully

Signed-off-by: Neil Horman
Reported-by: William Reich
Tested-by: William Reich
CC: "David S. Miller"
CC: William Reich
CC: eric.dumazet@gmail.com
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Neil Horman
2013-03-02 04:56:29 +0800

06 Feb, 2013

1 commit

547472b8e ipv4: Disallow non-namespace aware protocols to register. ... Browse Code »

All in-tree ipv4 protocol implementations are now namespace
aware. Therefore all the run-time checks are superfluous.

Reject registry of any non-namespace aware ipv4 protocol.
Eventually we'll remove prot->netns_ok and this registry
time check as well.

Signed-off-by: David S. Miller

David S. Miller
2013-02-06 03:42:23 +0800

31 Jul, 2012

1 commit

cca32e4bf net: TCP early demux cleanup ... Browse Code »

early_demux() handlers should be called in RCU context, and as we
use skb_dst_set_noref(skb, dst), caller must not exit from RCU context
before dst use (skb_dst(skb)) or release (skb_drop(dst))

Therefore, rcu_read_lock()/rcu_read_unlock() pairs around
->early_demux() are confusing and not needed :

Protocol handlers are already in an RCU read lock section.
(__netif_receive_skb() does the rcu_read_lock() )

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-31 05:53:21 +0800

27 Jul, 2012

2 commits

c7109986d ipv6: Early TCP socket demux ... Browse Code »

This is the IPv6 missing bits for infrastructure added in commit
41063e9dd1195 (ipv4: Early TCP socket demux.)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-27 06:50:39 +0800
c6cffba4f ipv4: Fix input route performance regression. ... Browse Code »
43

With the routing cache removal we lost the "noref" code paths on
input, and this can kill some routing workloads.

Reinstate the noref path when we hit a cached route in the FIB
nexthops.

With help from Eric Dumazet.

Reported-by: Alexander Duyck
Signed-off-by: David S. Miller
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

David S. Miller
2012-07-27 06:50:39 +0800

25 Jul, 2012

1 commit

9cb429d69 tcp: early_demux fixes ... Browse Code »

1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux()
and fix a potential bug if skb->head was reallocated
(iph & th pointers were not reloaded)

TCP stack will pull/check headers anyway.

2) must reload iph in ip_rcv_finish() after early_demux()
call since skb->head might have changed.

3) skb->dev->ifindex can be now replaced by skb->skb_iif

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-25 04:54:15 +0800

21 Jul, 2012

1 commit

38a424e46 ipv4: Kill ip_route_input_noref(). ... Browse Code »

The "noref" argument to ip_route_input_common() is now always ignored
because we do not cache routes, and in that case we must always grab
a reference to the resulting 'dst'.

Signed-off-by: David S. Miller

David Miller
2012-07-21 04:30:59 +0800

28 Jun, 2012

3 commits

160eb5a6b ipv4: Kill early demux method return value. ... Browse Code »

It's completely unnecessary.

Signed-off-by: David S. Miller

David S. Miller
2012-06-28 13:01:22 +0800
c10237e07 Revert "ipv4: tcp: dont cache unconfirmed intput dst" ... Browse Code »

This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7.

This change has several unwanted side effects:

1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
thus never create a real cached route.

2) All TCP traffic will use DST_NOCACHE and never use the routing
cache at all.

Signed-off-by: David S. Miller

David S. Miller
2012-06-28 08:05:06 +0800
c074da281 ipv4: tcp: dont cache unconfirmed intput dst ... Browse Code »
43

DDOS synflood attacks hit badly IP route cache.

On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.

rt_garbage_collect() triggers and tries to cleanup things.

Eventually route cache is disabled but machine is under fire and might
OOM and crash.

This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.

This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)

SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.

Signed-off-by: Eric Dumazet
Cc: Hans Schillstrom
Signed-off-by: David S. Miller

Eric Dumazet
2012-06-28 06:34:24 +0800

27 Jun, 2012

1 commit

251da4130 ipv4: Cache ip_error() routes even when not forwarding. ... Browse Code »

And account for the fact that, when we are not forwarding, we should
bump statistic counters rather than emit an ICMP response.

RP-filter rejected lookups are still not cached.

Since -EHOSTUNREACH and -ENETUNREACH can now no longer be seen in
ip_rcv_finish(), remove those checks.

Signed-off-by: David S. Miller

David S. Miller
2012-06-27 07:27:09 +0800

23 Jun, 2012

1 commit

6648bd7e0 ipv4: Add sysctl knob to control early socket demux ... Browse Code »

This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%. For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added. It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.

By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.

[ Move local variables in ip_rcv_finish() down into the basic
block in which they are actually used. -DaveM ]

Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller

Alexander Duyck
2012-06-23 08:11:13 +0800

20 Jun, 2012

1 commit

41063e9dd ipv4: Early TCP socket demux. ... Browse Code »
86

Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.

But we can optimize this down to one demux for certain kinds of local
sockets.

Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.

If a TCP socket is established then it's identity is fully specified.

This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.

Once we move to established state, we cache the receive packet's input
route to use later.

Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().

Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.

Signed-off-by: David S. Miller

David S. Miller
2012-06-20 12:22:05 +0800