Eric Lee / smarc-fsl-linux-kernel

07 Feb, 2019

1 commit

dc489ad6a Fix "net: ipv4: do not handle duplicate fragments as overlapping" ... Browse Code »

ade446403bfb ("net: ipv4: do not handle duplicate fragments as
overlapping") was backported to many stable trees, but it had a problem
that was "accidentally" fixed by the upstream commit 0ff89efb5246 ("ip:
fail fast on IP defrag errors")

This is the fixup for that problem as we do not want the larger patch in
the older stable trees.

Fixes: ade446403bfb ("net: ipv4: do not handle duplicate fragments as overlapping")
Reported-by: Ivan Babrou
Reported-by: Eric Dumazet
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
7 years ago

10 Jan, 2019

1 commit

95b4b7114 net: ipv4: do not handle duplicate fragments as overlapping ... Browse Code »

[ Upstream commit ade446403bfb79d3528d56071a84b15351a139ad ]

Since commit 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping
segments.") IPv4 reassembly code drops the whole queue whenever an
overlapping fragment is received. However, the test is written in a way
which detects duplicate fragments as overlapping so that in environments
with many duplicate packets, fragmented packets may be undeliverable.

Add an extra test and for (potentially) duplicate fragment, only drop the
new fragment rather than the whole queue. Only starting offset and length
are checked, not the contents of the fragments as that would be too
expensive. For similar reason, linear list ("run") of a rbtree node is not
iterated, we only check if the new fragment is a subset of the interval
covered by existing consecutive fragments.

v2: instead of an exact check iterating through linear list of an rbtree
node, only check if the new fragment is subset of the "run" (suggested
by Eric Dumazet)

Fixes: 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping segments.")
Signed-off-by: Michal Kubecek
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Michal Kubecek
7 years ago

17 Dec, 2018

1 commit

da4b69299 ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes ... Browse Code »

[ Upstream commit ebaf39e6032faf77218220707fc3fa22487784e0 ]

The *_frag_reasm() functions are susceptible to miscalculating the byte
count of packet fragments in case the truesize of a head buffer changes.
The truesize member may be changed by the call to skb_unclone(), leaving
the fragment memory limit counter unbalanced even if all fragments are
processed. This miscalculation goes unnoticed as long as the network
namespace which holds the counter is not destroyed.

Should an attempt be made to destroy a network namespace that holds an
unbalanced fragment memory limit counter the cleanup of the namespace
never finishes. The thread handling the cleanup gets stuck in
inet_frags_exit_net() waiting for the percpu counter to reach zero. The
thread is usually in running state with a stacktrace similar to:

PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
#5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
#6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
#7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
#8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
#9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
#10 [ffff880621563e38] process_one_work at ffffffff81096f14

It is not possible to create new network namespaces, and processes
that call unshare() end up being stuck in uninterruptible sleep state
waiting to acquire the net_mutex.

The bug was observed in the IPv6 netfilter code by Per Sundstrom.
I thank him for his analysis of the problem. The parts of this patch
that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

Signed-off-by: Jiri Wiesner
Reported-by: Per Sundstrom
Acked-by: Peter Oskolkov
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jiri Wiesner
7 years ago

04 Nov, 2018

1 commit

623670a9f net: drop skb on failure in ip_check_defrag() ... Browse Code »

[ Upstream commit 7de414a9dd91426318df7b63da024b2b07e53df5 ]

Most callers of pskb_trim_rcsum() simply drop the skb when
it fails, however, ip_check_defrag() still continues to pass
the skb up to stack. This is suspicious.

In ip_check_defrag(), after we learn the skb is an IP fragment,
passing the skb to callers makes no sense, because callers expect
fragments are defrag'ed on success. So, dropping the skb when we
can't defrag it is reasonable.

Note, prior to commit 88078d98d1bb, this is not a big problem as
checksum will be fixed up anyway. After it, the checksum is not
correct on failure.

Found this during code review.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet
Signed-off-by: Cong Wang
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Cong Wang
7 years ago

20 Sep, 2018

18 commits

08fb833b4 ip: frags: fix crash in ip_do_fragment() ... Browse Code »

commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
%iptables -t nat -I POSTROUTING -j MASQUERADE
%hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[ 261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[ 261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[ 261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[ 261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[ 261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[ 261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[ 261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[ 261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[ 261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[ 261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[ 261.174169] FS: 00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[ 261.183012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[ 261.198158] Call Trace:
[ 261.199018] ? dst_output+0x180/0x180
[ 261.205011] ? save_trace+0x300/0x300
[ 261.209018] ? ip_copy_metadata+0xb00/0xb00
[ 261.213034] ? sched_clock_local+0xd4/0x140
[ 261.218158] ? kill_l4proto+0x120/0x120 [nf_conntrack]
[ 261.223014] ? rt_cpu_seq_stop+0x10/0x10
[ 261.227014] ? find_held_lock+0x39/0x1c0
[ 261.233008] ip_finish_output+0x51d/0xb50
[ 261.237006] ? ip_fragment.constprop.56+0x220/0x220
[ 261.243011] ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[ 261.250152] ? rcu_is_watching+0x77/0x120
[ 261.255010] ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[ 261.261033] ? nf_hook_slow+0xb1/0x160
[ 261.265007] ip_output+0x1c7/0x710
[ 261.269005] ? ip_mc_output+0x13f0/0x13f0
[ 261.273002] ? __local_bh_enable_ip+0xe9/0x1b0
[ 261.278152] ? ip_fragment.constprop.56+0x220/0x220
[ 261.282996] ? nf_hook_slow+0xb1/0x160
[ 261.287007] raw_sendmsg+0x21f9/0x4420
[ 261.291008] ? dst_output+0x180/0x180
[ 261.297003] ? sched_clock_cpu+0x126/0x170
[ 261.301003] ? find_held_lock+0x39/0x1c0
[ 261.306155] ? stop_critical_timings+0x420/0x420
[ 261.311004] ? check_flags.part.36+0x450/0x450
[ 261.315005] ? _raw_spin_unlock_irq+0x29/0x40
[ 261.320995] ? _raw_spin_unlock_irq+0x29/0x40
[ 261.326142] ? cyc2ns_read_end+0x10/0x10
[ 261.330139] ? raw_bind+0x280/0x280
[ 261.334138] ? sched_clock_cpu+0x126/0x170
[ 261.338995] ? check_flags.part.36+0x450/0x450
[ 261.342991] ? __lock_acquire+0x4500/0x4500
[ 261.348994] ? inet_sendmsg+0x11c/0x500
[ 261.352989] ? dst_output+0x180/0x180
[ 261.357012] inet_sendmsg+0x11c/0x500
[ ... ]

v2:
- clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet
Signed-off-by: Taehee Yoo
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Taehee Yoo
7 years ago
b3a0c61b7 ip: process in-order fragments efficiently ... Browse Code »

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

throughput=9524.18
throughput_units=Mbit/s

upstream (net-next):

throughput=4608.93
throughput_units=Mbit/s

Reported-by: Willem de Bruijn
Signed-off-by: Peter Oskolkov
Cc: Eric Dumazet
Cc: Florian Westphal
Signed-off-by: David S. Miller
(cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
Signed-off-by: Greg Kroah-Hartman

Peter Oskolkov
7 years ago
c91f27fb5 ip: add helpers to process in-order fragments faster. ... Browse Code »

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
maintained/tracked. Fragments that arrive in-order, adjacent
to the previous tail fragment, are added to this tail run without
triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
it starts a new "tail" run, as is inserted into the rb-tree
at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn
Signed-off-by: Peter Oskolkov
Cc: Eric Dumazet
Cc: Florian Westphal
Signed-off-by: David S. Miller
(cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
Signed-off-by: Greg Kroah-Hartman

Peter Oskolkov
7 years ago
04b28f406 ipv4: frags: precedence bug in ip_expire() ... Browse Code »

We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter
Signed-off-by: David S. Miller
(cherry picked from commit 70837ffe3085c9a91488b52ca13ac84424da1042)
Signed-off-by: Greg Kroah-Hartman

Dan Carpenter
7 years ago
6b921536f net: sk_buff rbnode reorg ... Browse Code »

commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb

Signed-off-by: Eric Dumazet
Cc: Soheil Hassas Yeganeh
Cc: Wei Wang
Cc: Willem de Bruijn
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
1c4496911 ip: discard IPv4 datagrams with overlapping segments. ... Browse Code »

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller
Signed-off-by: Peter Oskolkov
Signed-off-by: Eric Dumazet
Cc: Florian Westphal
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller
(cherry picked from commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227)
Signed-off-by: Greg Kroah-Hartman

Peter Oskolkov
7 years ago
5fff99e88 inet: frags: fix ip6frag_low_thresh boundary ... Browse Code »

Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
since linker might place next to it a non zero value preventing a change
to ip6frag_low_thresh.

ip6frag_low_thresh is not used anymore in the kernel, but we do not
want to prematuraly break user scripts wanting to change it.

Since specifying a minimal value of 0 for proc_doulongvec_minmax()
is moot, let's remove these zero values in all defrag units.

Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
Signed-off-by: Eric Dumazet
Reported-by: Maciej Żenczykowski
Signed-off-by: David S. Miller
(cherry picked from commit 3d23401283e80ceb03f765842787e0e79ff598b7)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
48c2afc16 inet: frags: get rid of ipfrag_skb_cb/FRAG_CB ... Browse Code »

ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.

By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.

Note that after the fast path added by Changli Gao in commit
d6bebca92c66 ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.

Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit bf66337140c64c27fa37222b7abca7e49d63fb57)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
085a01474 inet: frags: do not clone skb in ip_expire() ... Browse Code »

An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
spinlock before calling icmp_send()")

While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)

We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit 1eec5d5670084ee644597bd26c25e22c69b9f748)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
990204ddc inet: frags: break the 2GB limit for frags storage ... Browse Code »

Some users are willing to provision huge amounts of memory to be able
to perform reassembly reasonnably well under pressure.

Current memory tracking is using one atomic_t and integers.

Switch to atomic_long_t so that 64bit arches can use more than 2GB,
without any cost for 32bit arches.

Note that this patch avoids an overflow error, if high_thresh was set
to ~2GB, since this test in inet_frag_alloc() was never true :

if (... || frag_mem_limit(nf) > nf->high_thresh)

Tested:

$ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

$ grep FRAG /proc/net/sockstat
FRAG: inuse 14705885 memory 16000002880

$ nstat -n ; sleep 1 ; nstat | grep Reas
IpReasmReqds 3317150 0.0
IpReasmFails 3317112 0.0

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
caa4249ec inet: frags: remove inet_frag_maybe_warn_overflow() ... Browse Code »

This function is obsolete, after rhashtable addition to inet defrag.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
5b1b3ad46 inet: frags: get rif of inet_frag_evicting() ... Browse Code »

This refactors ip_expire() since one indentation level is removed.

Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.

Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd64751d
("inet: frag: release spinlock before calling icmp_send()")

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit 399d1404be660d355192ff4df5ccc3f4159ec1e4)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
bd3df633f inet: frags: remove some helpers ... Browse Code »

Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d305405ab ("inet: frag: don't account number
of fragment queues")

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit 6befe4a78b1553edb6eed3a78b4bcd9748526672)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
9aee41eff inet: frags: use rhashtables for reassembly units ... Browse Code »

Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.

It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.

This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.

Then there is the problem of sharing this hash table for all netns.

It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.

Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.

Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.

After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)

$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608

A followup patch will change the limits for 64bit arches.

Signed-off-by: Eric Dumazet
Cc: Kirill Tkhai
Cc: Herbert Xu
Cc: Florian Westphal
Cc: Jesper Dangaard Brouer
Cc: Alexander Aring
Cc: Stefan Schmidt
Signed-off-by: David S. Miller
(cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
0512f7e93 inet: frags: Convert timers to use timer_setup() ... Browse Code »

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Alexander Aring
Cc: Stefan Schmidt
Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: Hideaki YOSHIFUJI
Cc: Pablo Neira Ayuso
Cc: Jozsef Kadlecsik
Cc: Florian Westphal
Cc: linux-wpan@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Signed-off-by: Kees Cook
Acked-by: Stefan Schmidt # for ieee802154
Signed-off-by: David S. Miller
(cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
Signed-off-by: Greg Kroah-Hartman

Kees Cook
7 years ago
0cbf74b95 inet: frags: refactor ipfrag_init() ... Browse Code »

We need to call inet_frags_init() before register_pernet_subsys(),
as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit 483a6e4fa055123142d8956866fe2aa9c98d546d)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
673220d64 inet: frags: add a pointer to struct netns_frags ... Browse Code »

In order to simplify the API, add a pointer to struct inet_frags.
This will allow us to make things less complex.

These functions no longer have a struct inet_frags parameter :

inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit 093ba72914b696521e4885756a68a3332782c8de)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago
6093d5abc inet: frags: change inet_frags_init_net() return value ... Browse Code »

We will soon initialize one rhashtable per struct netns_frags
in inet_frags_init_net().

This patch changes the return value to eventually propagate an
error.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
(cherry picked from commit 787bea7748a76130566f881c2342a0be4127d182)
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago

06 Aug, 2018

1 commit

8721f3608 ipv4: frags: handle possible skb truesize change ... Browse Code »

[ Upstream commit 4672694bd4f1aebdab0ad763ae4716e89cb15221 ]

ip_frag_queue() might call pskb_pull() on one skb that
is already in the fragment queue.

We need to take care of possible truesize change, or we
might have an imbalance of the netns frags memory usage.

IPv6 is immune to this bug, because RFC5722, Section 4,
amended by Errata ID 3089 states :

When reassembling an IPv6 datagram, if
one or more its constituent fragments is determined to be an
overlapping fragment, the entire datagram (and any constituent
fragments) MUST be silently discarded.

Fixes: 158f323b9868 ("net: adjust skb->truesize in pskb_expand_head()")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
7 years ago

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
8 years ago

04 Sep, 2017

1 commit

5a63643e5 Revert "net: fix percpu memory leaks" ... Browse Code »

This reverts commit 1d6119baf0610f813eb9d9580eb4fd16de5b4ceb.

After reverting commit 6d7b857d541e ("net: use lib/percpu_counter API
for fragmentation mem accounting") then here is no need for this
fix-up patch. As percpu_counter is no longer used, it cannot
memory leak it any-longer.

Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
8 years ago

01 Jul, 2017

1 commit

edcb69187 net: convert inet_frag_queue.refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
8 years ago

23 Mar, 2017

1 commit

ec4fbd647 inet: frag: release spinlock before calling icmp_send() ... Browse Code »

Dmitry reported a lockdep splat [1] (false positive) that we can fix
by releasing the spinlock before calling icmp_send() from ip_expire()

This is a false positive because sending an ICMP message can not
possibly re-enter the IP frag engine.

[1]
[ INFO: possible circular locking dependency detected ]
4.10.0+ #29 Not tainted
-------------------------------------------------------
modprobe/12392 is trying to acquire lock:
(_xmit_ETHER#2){+.-...}, at: [] spin_lock
include/linux/spinlock.h:299 [inline]
(_xmit_ETHER#2){+.-...}, at: [] __netif_tx_lock
include/linux/netdevice.h:3486 [inline]
(_xmit_ETHER#2){+.-...}, at: []
sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180

but task is already holding lock:
(&(&q->lock)->rlock){+.-...}, at: [] spin_lock
include/linux/spinlock.h:299 [inline]
(&(&q->lock)->rlock){+.-...}, at: []
ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&q->lock)->rlock){+.-...}:
validate_chain kernel/locking/lockdep.c:2267 [inline]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:299 [inline]
ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669
ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713
packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459
deliver_skb net/core/dev.c:1834 [inline]
dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890
xmit_one net/core/dev.c:2903 [inline]
dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923
sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182
__dev_xmit_skb net/core/dev.c:3092 [inline]
__dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308
neigh_output include/net/neighbour.h:478 [inline]
ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228
ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672
ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545
ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314
NF_HOOK_COND include/linux/netfilter.h:246 [inline]
ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
dst_output include/net/dst.h:486 [inline]
ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:643
___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985
__sys_sendmmsg+0x25c/0x750 net/socket.c:2075
SYSC_sendmmsg net/socket.c:2106 [inline]
SyS_sendmmsg+0x35/0x60 net/socket.c:2101
do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281
return_from_SYSCALL_64+0x0/0x7a

-> #0 (_xmit_ETHER#2){+.-...}:
check_prev_add kernel/locking/lockdep.c:1830 [inline]
check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
validate_chain kernel/locking/lockdep.c:2267 [inline]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:299 [inline]
__netif_tx_lock include/linux/netdevice.h:3486 [inline]
sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
__dev_xmit_skb net/core/dev.c:3092 [inline]
__dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
neigh_hh_output include/net/neighbour.h:468 [inline]
neigh_output include/net/neighbour.h:476 [inline]
ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:246 [inline]
ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
dst_output include/net/dst.h:486 [inline]
ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
expire_timers kernel/time/timer.c:1307 [inline]
__run_timers+0x960/0xcf0 kernel/time/timer.c:1601
run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
__do_softirq+0x31f/0xbe7 kernel/softirq.c:284
invoke_softirq kernel/softirq.c:364 [inline]
irq_exit+0x1cc/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:657 [inline]
smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
__read_once_size include/linux/compiler.h:254 [inline]
atomic_read arch/x86/include/asm/atomic.h:26 [inline]
rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
__rcu_is_watching kernel/rcu/tree.c:1133 [inline]
rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
do_fault_around mm/memory.c:3231 [inline]
do_read_fault mm/memory.c:3265 [inline]
do_fault+0xbd5/0x2080 mm/memory.c:3370
handle_pte_fault mm/memory.c:3600 [inline]
__handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
__do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&(&q->lock)->rlock);
lock(_xmit_ETHER#2);
lock(&(&q->lock)->rlock);
lock(_xmit_ETHER#2);

*** DEADLOCK ***

10 locks held by modprobe/12392:
#0: (&mm->mmap_sem){++++++}, at: []
__do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336
#1: (rcu_read_lock){......}, at: []
filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324
#2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: []
spin_lock include/linux/spinlock.h:299 [inline]
#2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: []
pte_alloc_one_map mm/memory.c:2944 [inline]
#2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: []
alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072
#3: (((&q->timer))){+.-...}, at: []
lockdep_copy_map include/linux/lockdep.h:175 [inline]
#3: (((&q->timer))){+.-...}, at: []
call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258
#4: (&(&q->lock)->rlock){+.-...}, at: [] spin_lock
include/linux/spinlock.h:299 [inline]
#4: (&(&q->lock)->rlock){+.-...}, at: []
ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201
#5: (rcu_read_lock){......}, at: []
ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216
#6: (slock-AF_INET){+.-...}, at: [] spin_trylock
include/linux/spinlock.h:309 [inline]
#6: (slock-AF_INET){+.-...}, at: [] icmp_xmit_lock
net/ipv4/icmp.c:219 [inline]
#6: (slock-AF_INET){+.-...}, at: []
icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681
#7: (rcu_read_lock_bh){......}, at: []
ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198
#8: (rcu_read_lock_bh){......}, at: []
__dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324
#9: (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at:
[] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423

stack backtrace:
CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:

__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:52
print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204
check_prev_add kernel/locking/lockdep.c:1830 [inline]
check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
validate_chain kernel/locking/lockdep.c:2267 [inline]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:299 [inline]
__netif_tx_lock include/linux/netdevice.h:3486 [inline]
sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
__dev_xmit_skb net/core/dev.c:3092 [inline]
__dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
neigh_hh_output include/net/neighbour.h:468 [inline]
neigh_output include/net/neighbour.h:476 [inline]
ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:246 [inline]
ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
dst_output include/net/dst.h:486 [inline]
ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
expire_timers kernel/time/timer.c:1307 [inline]
__run_timers+0x960/0xcf0 kernel/time/timer.c:1601
run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
__do_softirq+0x31f/0xbe7 kernel/softirq.c:284
invoke_softirq kernel/softirq.c:364 [inline]
irq_exit+0x1cc/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:657 [inline]
smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline]
RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline]
RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10
RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c
RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000
R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25
R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000

rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
do_fault_around mm/memory.c:3231 [inline]
do_read_fault mm/memory.c:3265 [inline]
do_fault+0xbd5/0x2080 mm/memory.c:3370
handle_pte_fault mm/memory.c:3600 [inline]
__handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
__do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011
RIP: 0033:0x7f83172f2786
RSP: 002b:00007fffe859ae80 EFLAGS: 00010293
RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970
RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000
R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040
R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000

Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Signed-off-by: David S. Miller

Eric Dumazet
8 years ago

28 Apr, 2016

1 commit

b45386efa net: rename IP_INC_STATS_BH() ... Browse Code »

Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
9 years ago

17 Feb, 2016

2 commits

52a773d64 net: Export ip fragment sysctl to unprivileged users ... Browse Code »

Now that all the ip fragmentation related sysctls are namespaceified
there is no reason to hide them anymore from "root" users inside
containers.

Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller

Nikolay Borisov
9 years ago
0fbf4cb27 ipv4: namespacify ip fragment max dist sysctl knob ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller

Nikolay Borisov
9 years ago

29 Jan, 2016

1 commit

8282f2744 inet: frag: Always orphan skbs inside ip_defrag() ... Browse Code »

Later parts of the stack (including fragmentation) expect that there is
never a socket attached to frag in a frag_list, however this invariant
was not enforced on all defrag paths. This could lead to the
BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
end of this commit message.

While the call could be added to openvswitch to fix this particular
error, the head and tail of the frags list are already orphaned
indirectly inside ip_defrag(), so it seems like the remaining fragments
should all be orphaned in all circumstances.

kernel BUG at net/ipv4/ip_output.c:586!
[...]
Call Trace:

[] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
[] ovs_fragment+0xcc/0x214 [openvswitch]
[] ? dst_discard_out+0x20/0x20
[] ? dst_ifdown+0x80/0x80
[] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
[] ? mod_timer_pending+0x65/0x210
[] ? __lock_acquire+0x3db/0x1b90
[] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
[] ? __lock_is_held+0x54/0x70
[] do_output.isra.29+0xe3/0x1b0 [openvswitch]
[] do_execute_actions+0xe11/0x11f0 [openvswitch]
[] ? __lock_is_held+0x54/0x70
[] ovs_execute_actions+0x32/0xd0 [openvswitch]
[] ovs_dp_process_packet+0x85/0x140 [openvswitch]
[] ? __lock_is_held+0x54/0x70
[] ovs_execute_actions+0xb2/0xd0 [openvswitch]
[] ovs_dp_process_packet+0x85/0x140 [openvswitch]
[] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
[] ovs_vport_receive+0x5d/0xa0 [openvswitch]
[] ? __lock_acquire+0x3db/0x1b90
[] ? __lock_acquire+0x3db/0x1b90
[] ? __lock_acquire+0x3db/0x1b90
[] ? internal_dev_xmit+0x5/0x140 [openvswitch]
[] internal_dev_xmit+0x6c/0x140 [openvswitch]
[] ? internal_dev_xmit+0x5/0x140 [openvswitch]
[] dev_hard_start_xmit+0x2b9/0x5e0
[] ? netif_skb_features+0xd1/0x1f0
[] __dev_queue_xmit+0x800/0x930
[] ? __dev_queue_xmit+0x50/0x930
[] ? mark_held_locks+0x71/0x90
[] ? neigh_resolve_output+0x106/0x220
[] dev_queue_xmit+0x10/0x20
[] neigh_resolve_output+0x178/0x220
[] ? ip_finish_output2+0x1ff/0x590
[] ip_finish_output2+0x1ff/0x590
[] ? ip_finish_output2+0x7e/0x590
[] ip_do_fragment+0x831/0x8a0
[] ? ip_copy_metadata+0x1b0/0x1b0
[] ip_fragment.constprop.49+0x43/0x80
[] ip_finish_output+0x17c/0x340
[] ? nf_hook_slow+0xe4/0x190
[] ip_output+0x70/0x110
[] ? ip_fragment.constprop.49+0x80/0x80
[] ip_local_out+0x39/0x70
[] ip_send_skb+0x19/0x40
[] ip_push_pending_frames+0x33/0x40
[] icmp_push_reply+0xea/0x120
[] icmp_reply.constprop.23+0x1ed/0x230
[] icmp_echo.part.21+0x4e/0x50
[] ? __lock_is_held+0x54/0x70
[] ? rcu_read_lock_held+0x5e/0x70
[] icmp_echo+0x36/0x70
[] icmp_rcv+0x271/0x450
[] ip_local_deliver_finish+0x127/0x3a0
[] ? ip_local_deliver_finish+0x41/0x3a0
[] ip_local_deliver+0x60/0xd0
[] ? ip_rcv_finish+0x560/0x560
[] ip_rcv_finish+0xdd/0x560
[] ip_rcv+0x283/0x3e0
[] ? match_held_lock+0x192/0x200
[] ? inet_del_offload+0x40/0x40
[] __netif_receive_skb_core+0x392/0xae0
[] ? process_backlog+0x8e/0x230
[] ? mark_held_locks+0x71/0x90
[] __netif_receive_skb+0x18/0x60
[] process_backlog+0x78/0x230
[] ? process_backlog+0xdd/0x230
[] net_rx_action+0x155/0x400
[] __do_softirq+0xcc/0x420
[] ? ip_finish_output2+0x217/0x590
[] do_softirq_own_stack+0x1c/0x30

[] do_softirq+0x4e/0x60
[] __local_bh_enable_ip+0xa8/0xb0
[] ip_finish_output2+0x240/0x590
[] ? ip_do_fragment+0x831/0x8a0
[] ip_do_fragment+0x831/0x8a0
[] ? ip_copy_metadata+0x1b0/0x1b0
[] ip_fragment.constprop.49+0x43/0x80
[] ip_finish_output+0x17c/0x340
[] ? nf_hook_slow+0xe4/0x190
[] ip_output+0x70/0x110
[] ? ip_fragment.constprop.49+0x80/0x80
[] ip_local_out+0x39/0x70
[] ip_send_skb+0x19/0x40
[] ip_push_pending_frames+0x33/0x40
[] raw_sendmsg+0x7d3/0xc30
[] ? __lock_acquire+0x3db/0x1b90
[] ? inet_sendmsg+0xc7/0x1d0
[] ? __lock_is_held+0x54/0x70
[] inet_sendmsg+0x10a/0x1d0
[] ? inet_sendmsg+0x5/0x1d0
[] sock_sendmsg+0x38/0x50
[] ___sys_sendmsg+0x25f/0x270
[] ? handle_mm_fault+0x8dd/0x1320
[] ? _raw_spin_unlock+0x27/0x40
[] ? __do_page_fault+0x1e2/0x460
[] ? __fget_light+0x66/0x90
[] __sys_sendmsg+0x42/0x80
[] SyS_sendmsg+0x12/0x20
[] entry_SYSCALL_64_fastpath+0x12/0x6f
Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
RIP [] ip_do_fragment+0x892/0x8a0
RSP

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Joe Stringer
Signed-off-by: David S. Miller

Joe Stringer
10 years ago

06 Jan, 2016

1 commit

a72a5e2d3 inet: kill unused skb_free op ... Browse Code »

The only user was removed in commit
029f7f3b8701cc7a ("netfilter: ipv6: nf_defrag: avoid/free clone operations").

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
10 years ago

03 Nov, 2015

1 commit

1d6119baf net: fix percpu memory leaks ... Browse Code »

This patch fixes following problems :

1) percpu_counter_init() can return an error, therefore
init_frag_mem_limit() must propagate this error so that
inet_frags_init_net() can do the same up to its callers.

2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
properly and free the percpu_counter.

Without this fix, we leave freed object in percpu_counters
global list (if CONFIG_HOTPLUG_CPU) leading to crashes.

This bug was detected by KASAN and syzkaller tool
(http://github.com/google/syzkaller)

Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Hannes Frederic Sowa
Cc: Jesper Dangaard Brouer
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Eric Dumazet
10 years ago

13 Oct, 2015

1 commit

19bcf9f20 ipv4: Pass struct net into ip_defrag and ip_check_defrag ... Browse Code »

The function ip_defrag is called on both the input and the output
paths of the networking stack. In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.

So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.

Signed-off-by: "Eric W. Biederman"
Acked-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller

Eric W. Biederman
10 years ago

30 Sep, 2015

1 commit

385add906 net: Replace vrf_master_ifindex{, _rcu} with l3mdev equivalents ... Browse Code »

Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either
l3mdev_master_ifindex_rcu or l3mdev_master_ifindex.

The pattern:
oif = vrf_master_ifindex(dev) ? : dev->ifindex;
is replaced with
oif = l3mdev_fib_oif(dev);

And remove the now unused vrf macros.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
10 years ago

29 Aug, 2015

1 commit

192132b9a net: Add support for VRFs to inetpeer cache ... Browse Code »

inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Enhance the ipv4 address key
to contain both the IPv4 address and VRF device index.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
10 years ago

14 Aug, 2015

1 commit

9972f134a net: frags: Add VRF device index to cache and lookup ... Browse Code »

Fragmentation cache uses information from the IP header to reassemble
packets. That information can be duplicated across VRFs -- same source
and destination addresses, protocol and id. Handle fragmentation with
VRFs by adding the VRF device index to entries in the cache and the
lookup arg.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
10 years ago

01 Aug, 2015

1 commit

5510b3c2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c

All four conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller

David S. Miller
10 years ago

27 Jul, 2015

2 commits

caaecdd3d inet: frags: remove INET_FRAG_EVICTED and use list_evictor for the test ... Browse Code »

We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags
race conditions with the evictor and use a participation test for the
evictor list, when we're at that point (after inet_frag_kill) in the
timer there're 2 possible cases:

1. The evictor added the entry to its evictor list while the timer was
waiting for the chainlock
or
2. The timer unchained the entry and the evictor won't see it

In both cases we should be able to see list_evictor correctly due
to the sync on the chainlock.

Joint work with Florian Westphal.

Tested-by: Frank Schreuder
Signed-off-by: Nikolay Aleksandrov
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Nikolay Aleksandrov
10 years ago
0e60d245a inet: frag: change *_frag_mem_limit functions to take netns_frags as argument ... Browse Code »

Followup patch will call it after inet_frag_queue was freed, so q->net
doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would).

Tested-by: Frank Schreuder
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
10 years ago