Eric Lee / smarc-fsl-linux-kernel

16 May, 2018

1 commit

ac91ff2a5 inetpeer: fix uninit-value in inet_getpeer ... Browse Code »

commit b6a37e5e25414df4b8e9140a5c6f5ee0ec6f3b90 upstream.

syzbot/KMSAN reported that p->dtime was read while it was
not yet initialized in :

delta = (__u32)jiffies - p->dtime;
if (delta < ttl || !refcount_dec_if_one(&p->refcnt))
gc_stack[i] = NULL;

This is a false positive, because the inetpeer wont be erased
from rb-tree if the refcount_dec_if_one(&p->refcnt) does not
succeed. And this wont happen before first inet_putpeer() call
for this inetpeer has been done, and ->dtime field is written
exactly before the refcount_dec_and_test(&p->refcnt).

The KMSAN report was :

BUG: KMSAN: uninit-value in inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
BUG: KMSAN: uninit-value in inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
CPU: 0 PID: 9494 Comm: syz-executor5 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x185/0x1d0 lib/dump_stack.c:53
kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
__msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
icmpv4_xrlim_allow net/ipv4/icmp.c:330 [inline]
icmp_send+0x2b44/0x3050 net/ipv4/icmp.c:725
ip_options_compile+0x237c/0x29f0 net/ipv4/ip_options.c:472
ip_rcv_options net/ipv4/ip_input.c:284 [inline]
ip_rcv_finish+0xda8/0x16d0 net/ipv4/ip_input.c:365
NF_HOOK include/linux/netfilter.h:288 [inline]
ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
__netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
__netif_receive_skb net/core/dev.c:4627 [inline]
netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
netif_receive_skb+0x230/0x240 net/core/dev.c:4725
tun_rx_batched drivers/net/tun.c:1555 [inline]
tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
do_iter_write+0x30d/0xd40 fs/read_write.c:932
vfs_writev fs/read_write.c:977 [inline]
do_writev+0x3c9/0x830 fs/read_write.c:1012
SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
SyS_writev+0x56/0x80 fs/read_write.c:1082
do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455111
RSP: 002b:00007fae0365cba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 000000000000002e RCX: 0000000000455111
RDX: 0000000000000001 RSI: 00007fae0365cbf0 RDI: 00000000000000fc
RBP: 0000000020000040 R08: 00000000000000fc R09: 0000000000000000
R10: 000000000000002e R11: 0000000000000293 R12: 00000000ffffffff
R13: 0000000000000658 R14: 00000000006fc8e0 R15: 0000000000000000

Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
inet_getpeer+0xed8/0x1e70 net/ipv4/inetpeer.c:210
inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
ip4_frag_init+0x4d1/0x740 net/ipv4/ip_fragment.c:153
inet_frag_alloc net/ipv4/inet_fragment.c:369 [inline]
inet_frag_create net/ipv4/inet_fragment.c:385 [inline]
inet_frag_find+0x7da/0x1610 net/ipv4/inet_fragment.c:418
ip_find net/ipv4/ip_fragment.c:275 [inline]
ip_defrag+0x448/0x67a0 net/ipv4/ip_fragment.c:676
ip_check_defrag+0x775/0xda0 net/ipv4/ip_fragment.c:724
packet_rcv_fanout+0x2a8/0x8d0 net/packet/af_packet.c:1447
deliver_skb net/core/dev.c:1897 [inline]
deliver_ptype_list_skb net/core/dev.c:1912 [inline]
__netif_receive_skb_core+0x314a/0x4a80 net/core/dev.c:4545
__netif_receive_skb net/core/dev.c:4627 [inline]
netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
netif_receive_skb+0x230/0x240 net/core/dev.c:4725
tun_rx_batched drivers/net/tun.c:1555 [inline]
tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
do_iter_write+0x30d/0xd40 fs/read_write.c:932
vfs_writev fs/read_write.c:977 [inline]
do_writev+0x3c9/0x830 fs/read_write.c:1012
SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
SyS_writev+0x56/0x80 fs/read_write.c:1082
do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Signed-off-by: Eric Dumazet
Reported-by: syzbot
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2018-05-16 16:10:24 +0800

29 Sep, 2017

1 commit

35f493b87 inetpeer: fix RCU lookup() again ... Browse Code »

My prior fix was not complete, as we were dereferencing a pointer
three times per node, not twice as I initially thought.

Fixes: 4cc5b44b29a9 ("inetpeer: fix RCU lookup()")
Fixes: b145425f269a ("inetpeer: remove AVL implementation in favor of RB tree")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-09-29 00:39:34 +0800

02 Sep, 2017

1 commit

4cc5b44b2 inetpeer: fix RCU lookup() ... Browse Code »

Excess of seafood or something happened while I cooked the commit
adding RB tree to inetpeer.

Of course, RCU rules need to be respected or bad things can happen.

In this particular loop, we need to read *pp once per iteration, not
twice.

Fixes: b145425f269a ("inetpeer: remove AVL implementation in favor of RB tree")
Reported-by: John Sperbeck
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-09-02 08:33:17 +0800

17 Jul, 2017

1 commit

b145425f2 inetpeer: remove AVL implementation in favor of RB tree ... Browse Code »

As discussed in Faro during Netfilter Workshop 2017, RB trees can be
used with RCU, using a seqlock.

Note that net/rxrpc/conn_service.c is already using this.

This patch converts inetpeer from AVL tree to RB tree, since it allows
to remove private AVL implementation in favor of shared RB code.

$ size net/ipv4/inetpeer.before net/ipv4/inetpeer.after
text data bss dec hex filename
3195 40 128 3363 d23 net/ipv4/inetpeer.before
1562 24 0 1586 632 net/ipv4/inetpeer.after

The same technique can be used to speed up
net/netfilter/nft_set_rbtree.c (removing rwlock contention in fast path)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-07-17 23:59:01 +0800

01 Jul, 2017

1 commit

1cc9a98b5 net: convert inet_peer.refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
This conversion requires overall +1 on the whole
refcounting scheme.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:07 +0800

29 Aug, 2015

1 commit

d39d14ffa net: Add helper function to compare inetpeer addresses ... Browse Code »

tcp_metrics and inetpeer both have functions to compare inetpeer
addresses. Consolidate into 1 version.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2015-08-29 04:32:36 +0800

09 Sep, 2014

1 commit

a7f26b7e1 inet: remove dead inetpeer sequence code ... Browse Code »

inetpeer sequence numbers are no longer incremented, so no need to
check and flush the tree. The function that increments the sequence
number was already dead code and removed in in "ipv4: remove unused
function" (068a6e18). Remove the code that checks for a change, too.

Verifying that v4_seq and v6_seq are never incremented and thus that
flush_check compares bp->flush_seq to 0 is trivial.

The second part of the change removes flush_check completely even
though bp->flush_seq is exactly !0 once, at initialization. This
change is correct because the time this branch is true is when
bp->root == peer_avl_empty_rcu, in which the branch and
inetpeer_invalidate_tree are a NOOP.

Signed-off-by: Willem de Bruijn
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Willem de Bruijn
2014-09-09 07:42:42 +0800

13 Jun, 2014

1 commit

f9da455b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.

3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.

4) BPF now has a "random" opcode, from Chema Gonzalez.

5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.

6) Support TCP fastopen over ipv6, from Daniel Lee.

7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

8) Support software TSO in fec driver too, from Nimrod Andy.

9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.

11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.

12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.

13) Support busy polling in SCTP, from Neal Horman.

14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

15) Bridge promisc mode handling improvements from Vlad Yasevich.

16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...

Linus Torvalds
2014-06-13 05:27:40 +0800

03 Jun, 2014

1 commit

73f156a6e inetpeer: get rid of ip_id_count ... Browse Code »

Ideally, we would need to generate IP ID using a per destination IP
generator.

linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.

1) each inet_peer struct consumes 192 bytes

2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.

3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.

4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())

5) We garbage collect inet_peer aggressively.

IP ID generation do not have to be 'perfect'

Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.

We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.

ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)

secure_ip_id() and secure_ipv6_id() no longer are needed.

Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-06-03 02:00:41 +0800

27 Apr, 2014

1 commit

851bdd11c inetpeer_gc_worker: trivial cleanup ... Browse Code »

Do not initialize list twice.
list_replace_init() already takes care of initializing list.
We don't need to initialize it with LIST_HEAD() beforehand.

Signed-off-by: xiao jin
Reviewed-by: David Cohen
Signed-off-by: David S. Miller

xiao jin
2014-04-27 00:52:28 +0800

18 Apr, 2014

1 commit

4e857c58e arch: Mass conversion of smp_mb__*() ... Browse Code »

Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra
Acked-by: Paul E. McKenney
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-04-18 20:20:48 +0800

29 Dec, 2013

1 commit

068a6e183 ipv4: remove unused function ... Browse Code »

inetpeer_invalidate_family defined but never used

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2013-12-29 06:03:20 +0800

27 Dec, 2013

1 commit

0c9a67d2e ipv4: fix checkpatch error "space prohibited" ... Browse Code »

Signed-off-by: Weilong Chen
Signed-off-by: David S. Miller

Weilong Chen
2013-12-27 02:43:21 +0800

20 Sep, 2013

1 commit

703133de3 ip: generate unique IP identificator if local fragmentation is allowed ... Browse Code »

If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.

For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.

Signed-off-by: Ansis Atteka
Signed-off-by: David S. Miller

Ansis Atteka
2013-09-20 02:11:15 +0800

03 Oct, 2012

1 commit

033d9959e Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.

* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.

* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.

These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.

* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.

All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.

* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.

* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.

There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...

Linus Torvalds
2012-10-03 00:54:49 +0800

28 Sep, 2012

1 commit

bc9259a8b inetpeer: fix token initialization ... Browse Code »

When jiffies wraps around (for example, 5 minutes after the boot, see
INITIAL_JIFFIES) and peer has just been created, now - peer->rate_last can be
< XRLIM_BURST_FACTOR * timeout, so token is not set to the maximum value, thus
some icmp packets can be unexpectedly dropped.

Fix this case by initializing last_rate to 60 seconds in the past.

Signed-off-by: Nicolas Dichtel
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Nicolas Dichtel
2012-09-28 07:27:39 +0800

22 Aug, 2012

1 commit

203b42f73 workqueue: make deferrable delayed_work initializer names consistent ... Browse Code »

Initalizers for deferrable delayed_work are confused.

* __DEFERRED_WORK_INITIALIZER()
* DECLARE_DEFERRED_WORK()
* INIT_DELAYED_WORK_DEFERRABLE()

Rename them to

* __DEFERRABLE_WORK_INITIALIZER()
* DECLARE_DEFERRABLE_WORK()
* INIT_DEFERRABLE_WORK()

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo

Tejun Heo
2012-08-22 04:18:23 +0800

11 Jul, 2012

2 commits

5943634fc ipv4: Maintain redirect and PMTU info in struct rtable again. ... Browse Code »

Maintaining this in the inetpeer entries was not the right way to do
this at all.

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 13:40:14 +0800
81166dd6f tcp: Move timestamps from inetpeer to metrics cache. ... Browse Code »

With help from Lin Ming.

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 13:40:08 +0800

21 Jun, 2012

1 commit

da5573746 inetpeer: inetpeer_invalidate_tree() cleanup ... Browse Code »

No need to use cmpxchg() in inetpeer_invalidate_tree() since we hold
base lock.

Also use correct rcu annotations to remove sparse errors
(CONFIG_SPARSE_RCU_POINTER=y)

net/ipv4/inetpeer.c:144:19: error: incompatible types in comparison
expression (different address spaces)
net/ipv4/inetpeer.c:149:20: error: incompatible types in comparison
expression (different address spaces)
net/ipv4/inetpeer.c:595:10: error: incompatible types in comparison
expression (different address spaces)

Signed-off-by: Eric Dumazet
Cc: Steffen Klassert
Signed-off-by: David S. Miller

Eric Dumazet
2012-06-21 05:38:55 +0800

11 Jun, 2012

1 commit

b48c80ece inet: Add family scope inetpeer flushes. ... Browse Code »

This implementation can deal with having many inetpeer roots, which is
a necessary prerequisite for per-FIB table rooted peer tables.

Each family (AF_INET, AF_INET6) has a sequence number which we bump
when we get a family invalidation request.

Each peer lookup cheaply checks whether the flush sequence of the
root we are using is out of date, and if so flushes it and updates
the sequence number.

Signed-off-by: David S. Miller

David S. Miller
2012-06-11 17:09:10 +0800

10 Jun, 2012

3 commits

c0efc887d inet: Pass inetpeer root into inet_getpeer*() interfaces. ... Browse Code »

Otherwise we reference potentially non-existing members when
ipv6 is disabled.

Signed-off-by: David S. Miller

David S. Miller
2012-06-10 10:12:36 +0800
56a6b248e inet: Consolidate inetpeer_invalidate_tree() interfaces. ... Browse Code »

We only need one interface for this operation, since we always know
which inetpeer root we want to flush.

Signed-off-by: David S. Miller

David S. Miller
2012-06-10 07:32:41 +0800
c3426b471 inet: Initialize per-netns inetpeer roots in net/ipv{4,6}/route.c ... Browse Code »

Instead of net/ipv4/inetpeer.c

Signed-off-by: David S. Miller

David S. Miller
2012-06-10 07:27:05 +0800

09 Jun, 2012

1 commit

c8a627ed0 inetpeer: add namespace support for inetpeer ... Browse Code »

now inetpeer doesn't support namespace,the information will
be leaking across namespace.

this patch move the global vars v4_peers and v6_peers to
netns_ipv4 and netns_ipv6 as a field peers.

add struct pernet_operations inetpeer_ops to initial pernet
inetpeer data.

and change family_to_base and inet_getpeer to support namespace.

Signed-off-by: Gao feng
Signed-off-by: David S. Miller

Gao feng
2012-06-09 05:27:23 +0800

07 Jun, 2012

1 commit

55432d2b5 inetpeer: fix a race in inetpeer_gc_worker() ... Browse Code »

commit 5faa5df1fa2024 (inetpeer: Invalidate the inetpeer tree along with
the routing cache) added a race :

Before freeing an inetpeer, we must respect a RCU grace period, and make
sure no user will attempt to increase refcnt.

inetpeer_invalidate_tree() waits for a RCU grace period before inserting
inetpeer tree into gc_list and waking the worker. At that time, no
concurrent lookup can find a inetpeer in this tree.

Signed-off-by: Eric Dumazet
Cc: Steffen Klassert
Acked-by: Steffen Klassert
Signed-off-by: David S. Miller

Eric Dumazet
2012-06-07 01:45:15 +0800

08 Mar, 2012

2 commits

ac3f48de0 route: Remove redirect_genid ... Browse Code »

As we invalidate the inetpeer tree along with the routing cache now,
we don't need a genid to reset the redirect handling when the routing
cache is flushed.

Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2012-03-08 16:30:32 +0800
5faa5df1f inetpeer: Invalidate the inetpeer tree along with the routing cache ... Browse Code »
43

We initialize the routing metrics with the values cached on the
inetpeer in rt_init_metrics(). So if we have the metrics cached on the
inetpeer, we ignore the user configured fib_metrics.

To fix this issue, we replace the old tree with a fresh initialized
inet_peer_base. The old tree is removed later with a delayed work queue.

Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2012-03-08 16:30:24 +0800

18 Jan, 2012

1 commit

10ec1bb7e inetpeer: initialize ->redirect_genid in inet_getpeer() ... Browse Code »

kmemcheck complains that ->redirect_genid doesn't get initialized.
Presumably it should be set to zero.

Signed-off-by: Dan Carpenter
Signed-off-by: David S. Miller

Dan Carpenter
2012-01-18 04:52:12 +0800

17 Jan, 2012

1 commit

747465ef7 net: fix some sparse errors ... Browse Code »

make C=2 CF="-D__CHECK_ENDIAN__" M=net

And fix flowi4_init_output() prototype for sport

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-01-17 23:31:12 +0800

07 Aug, 2011

1 commit

6e5714eaf net: Compute protocol sequence numbers and fragment IDs using MD5. ... Browse Code »

Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation. So the periodic
regeneration and 8-bit counter have been removed. We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.

Reported-by: Dan Kaminsky
Tested-by: Willy Tarreau
Signed-off-by: David S. Miller

David S. Miller
2011-08-07 09:33:19 +0800

22 Jul, 2011

1 commit

87c48fa3b ipv6: make fragment identifications less predictable ... Browse Code »
2

IPv6 fragment identification generation is way beyond what we use for
IPv4 : It uses a single generator. Its not scalable and allows DOS
attacks.

Now inetpeer is IPv6 aware, we can use it to provide a more secure and
scalable frag ident generator (per destination, instead of system wide)

This patch :
1) defines a new secure_ipv6_id() helper
2) extends inet_getid() to provide 32bit results
3) extends ipv6_select_ident() with a new dest parameter

Reported-by: Fernando Gont
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-07-22 12:25:58 +0800

12 Jul, 2011

1 commit

6d1a3e042 inetpeer: kill inet_putpeer race ... Browse Code »

We currently can free inetpeer entries too early :

[ 782.636674] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f130f44c)
[ 782.636677] 1f7b13c100000000000000000000000002000000000000000000000000000000
[ 782.636686] i i i i u u u u i i i i u u u u i i i i u u u u u u u u u u u u
[ 782.636694] ^
[ 782.636696]
[ 782.636698] Pid: 4638, comm: ssh Not tainted 3.0.0-rc5+ #270 Hewlett-Packard HP Compaq 6005 Pro SFF PC/3047h
[ 782.636702] EIP: 0060:[] EFLAGS: 00010286 CPU: 0
[ 782.636707] EIP is at inet_getpeer+0x25b/0x5a0
[ 782.636709] EAX: 00000002 EBX: 00010080 ECX: f130f3c0 EDX: f0209d30
[ 782.636711] ESI: 0000bc87 EDI: 0000ea60 EBP: f0209ddc ESP: c173134c
[ 782.636712] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 782.636714] CR0: 8005003b CR2: f0beca80 CR3: 30246000 CR4: 000006d0
[ 782.636716] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 782.636717] DR6: ffff4ff0 DR7: 00000400
[ 782.636718] [] rt_set_nexthop.clone.45+0x56/0x220
[ 782.636722] [] __ip_route_output_key+0x309/0x860
[ 782.636724] [] tcp_v4_connect+0x124/0x450
[ 782.636728] [] inet_stream_connect+0xa3/0x270
[ 782.636731] [] sys_connect+0xa1/0xb0
[ 782.636733] [] sys_socketcall+0x25d/0x2a0
[ 782.636736] [] sysenter_do_call+0x12/0x28
[ 782.636738] [] 0xffffffff

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-07-12 11:25:04 +0800

09 Jun, 2011

1 commit

4b9d9be83 inetpeer: remove unused list ... Browse Code »

Andi Kleen and Tim Chen reported huge contention on inetpeer
unused_peers.lock, on memcached workload on a 40 core machine, with
disabled route cache.

It appears we constantly flip peers refcnt between 0 and 1 values, and
we must insert/remove peers from unused_peers.list, holding a contended
spinlock.

Remove this list completely and perform a garbage collection on-the-fly,
at lookup time, using the expired nodes we met during the tree
traversal.

This removes a lot of code, makes locking more standard, and obsoletes
two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
removes two pointers in inet_peer structure.

There is still a false sharing effect because refcnt is in first cache
line of object [were the links and keys used by lookups are located], we
might move it at the end of inet_peer structure to let this first cache
line mostly read by cpus.

Signed-off-by: Eric Dumazet
CC: Andi Kleen
CC: Tim Chen
Signed-off-by: David S. Miller

Eric Dumazet
2011-06-09 08:05:30 +0800

28 May, 2011

1 commit

686a7e32c inetpeer: fix race in unused_list manipulations ... Browse Code »

Several crashes in cleanup_once() were reported in recent kernels.

Commit d6cc1d642de9 (inetpeer: various changes) added a race in
unlink_from_unused().

One way to avoid taking unused_peers.lock before doing the list_empty()
test is to catch 0->1 refcnt transitions, using full barrier atomic
operations variants (atomic_cmpxchg() and atomic_inc_return()) instead
of previous atomic_inc() and atomic_add_unless() variants.

We then call unlink_from_unused() only for the owner of the 0->1
transition.

Add a new atomic_add_unless_return() static helper

With help from Arun Sharma.

Refs: https://bugzilla.kernel.org/show_bug.cgi?id=32772

Reported-by: Arun Sharma
Reported-by: Maximilian Engelhardt
Reported-by: Yann Dupont
Reported-by: Denys Fedoryshchenko
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-05-28 01:39:11 +0800

13 Apr, 2011

1 commit

66944e1c5 inetpeer: reduce stack usage ... Browse Code »

On 64bit arches, we use 752 bytes of stack when cleanup_once() is called
from inet_getpeer().

Lets share the avl stack to save ~376 bytes.

Before patch :

# objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl

0x000006c3 unlink_from_pool [inetpeer.o]: 376
0x00000721 unlink_from_pool [inetpeer.o]: 376
0x00000cb1 inet_getpeer [inetpeer.o]: 376
0x00000e6d inet_getpeer [inetpeer.o]: 376
0x0004 inet_initpeers [inetpeer.o]: 112
# size net/ipv4/inetpeer.o
text data bss dec hex filename
5320 432 21 5773 168d net/ipv4/inetpeer.o

After patch :

objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl
0x00000c11 inet_getpeer [inetpeer.o]: 376
0x00000dcd inet_getpeer [inetpeer.o]: 376
0x00000ab9 peer_check_expire [inetpeer.o]: 328
0x00000b7f peer_check_expire [inetpeer.o]: 328
0x0004 inet_initpeers [inetpeer.o]: 112
# size net/ipv4/inetpeer.o
text data bss dec hex filename
5163 432 21 5616 15f0 net/ipv4/inetpeer.o

Signed-off-by: Eric Dumazet
Cc: Scot Doyle
Cc: Stephen Hemminger
Cc: Hiroaki SHIMODA
Reviewed-by: Hiroaki SHIMODA
Signed-off-by: David S. Miller

Eric Dumazet
2011-04-13 04:58:33 +0800

14 Mar, 2011

2 commits

4e75db2e8 inetpeer: should use call_rcu() variant ... Browse Code »

After commit 7b46ac4e77f3224a (inetpeer: Don't disable BH for initial
fast RCU lookup.), we should use call_rcu() to wait proper RCU grace
period.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-03-14 14:22:23 +0800
46af31800 ipv4: Fix PMTU update. ... Browse Code »

On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4
(Destination unreachable (Fragmentation needed)),

icmp_unreach
-> ip_rt_frag_needed
(peer->pmtu_expires is set here)
-> tcp_v4_err
-> do_pmtu_discovery
-> ip_rt_update_pmtu
(peer->pmtu_expires is already set,
so check_peer_pmtu is skipped.)
-> check_peer_pmtu

check_peer_pmtu is skipped and MTU is not updated.

To fix this, let check_peer_pmtu execute unconditionally.
And some minor fixes
1) Avoid potential peer->pmtu_expires set to be zero.
2) In check_peer_pmtu, argument of time_before is reversed.
3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero,
but not initialized.

Signed-off-by: Hiroaki SHIMODA
Signed-off-by: David S. Miller

Hiroaki SHIMODA
2011-03-14 09:37:49 +0800

09 Mar, 2011

1 commit

7b46ac4e7 inetpeer: Don't disable BH for initial fast RCU lookup. ... Browse Code »

If modifications on other cpus are ok, then modifications to
the tree during lookup done by the local cpu are ok too.

Signed-off-by: David S. Miller

David S. Miller
2011-03-09 06:59:28 +0800

05 Mar, 2011

1 commit

65e8354ec inetpeer: seqlock optimization ... Browse Code »

David noticed :

------------------
Eric, I was profiling the non-routing-cache case and something that
stuck out is the case of calling inet_getpeer() with create==0.

If an entry is not found, we have to redo the lookup under a spinlock
to make certain that a concurrent writer rebalancing the tree does
not "hide" an existing entry from us.

This makes the case of a create==0 lookup for a not-present entry
really expensive. It is on the order of 600 cpu cycles on my
Niagara2.

I added a hack to not do the relookup under the lock when create==0
and it now costs less than 300 cycles.

This is now a pretty common operation with the way we handle COW'd
metrics, so I think it's definitely worth optimizing.
-----------------

One solution is to use a seqlock instead of a spinlock to protect struct
inet_peer_base.

After a failed avl tree lookup, we can easily detect if a writer did
some changes during our lookup. Taking the lock and redo the lookup is
only necessary in this case.

Note: Add one private rcu_deref_locked() macro to place in one spot the
access to spinlock included in seqlock.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-03-05 06:33:59 +0800