16 May, 2018

1 commit

  • commit b6a37e5e25414df4b8e9140a5c6f5ee0ec6f3b90 upstream.

    syzbot/KMSAN reported that p->dtime was read while it was
    not yet initialized in :

    delta = (__u32)jiffies - p->dtime;
    if (delta < ttl || !refcount_dec_if_one(&p->refcnt))
    gc_stack[i] = NULL;

    This is a false positive, because the inetpeer wont be erased
    from rb-tree if the refcount_dec_if_one(&p->refcnt) does not
    succeed. And this wont happen before first inet_putpeer() call
    for this inetpeer has been done, and ->dtime field is written
    exactly before the refcount_dec_and_test(&p->refcnt).

    The KMSAN report was :

    BUG: KMSAN: uninit-value in inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
    BUG: KMSAN: uninit-value in inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
    CPU: 0 PID: 9494 Comm: syz-executor5 Not tainted 4.16.0+ #82
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x185/0x1d0 lib/dump_stack.c:53
    kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
    __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
    inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
    inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
    inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
    icmpv4_xrlim_allow net/ipv4/icmp.c:330 [inline]
    icmp_send+0x2b44/0x3050 net/ipv4/icmp.c:725
    ip_options_compile+0x237c/0x29f0 net/ipv4/ip_options.c:472
    ip_rcv_options net/ipv4/ip_input.c:284 [inline]
    ip_rcv_finish+0xda8/0x16d0 net/ipv4/ip_input.c:365
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
    __netif_receive_skb net/core/dev.c:4627 [inline]
    netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
    netif_receive_skb+0x230/0x240 net/core/dev.c:4725
    tun_rx_batched drivers/net/tun.c:1555 [inline]
    tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
    tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
    do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
    do_iter_write+0x30d/0xd40 fs/read_write.c:932
    vfs_writev fs/read_write.c:977 [inline]
    do_writev+0x3c9/0x830 fs/read_write.c:1012
    SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
    SyS_writev+0x56/0x80 fs/read_write.c:1082
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x455111
    RSP: 002b:00007fae0365cba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
    RAX: ffffffffffffffda RBX: 000000000000002e RCX: 0000000000455111
    RDX: 0000000000000001 RSI: 00007fae0365cbf0 RDI: 00000000000000fc
    RBP: 0000000020000040 R08: 00000000000000fc R09: 0000000000000000
    R10: 000000000000002e R11: 0000000000000293 R12: 00000000ffffffff
    R13: 0000000000000658 R14: 00000000006fc8e0 R15: 0000000000000000

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
    kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
    kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
    inet_getpeer+0xed8/0x1e70 net/ipv4/inetpeer.c:210
    inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
    ip4_frag_init+0x4d1/0x740 net/ipv4/ip_fragment.c:153
    inet_frag_alloc net/ipv4/inet_fragment.c:369 [inline]
    inet_frag_create net/ipv4/inet_fragment.c:385 [inline]
    inet_frag_find+0x7da/0x1610 net/ipv4/inet_fragment.c:418
    ip_find net/ipv4/ip_fragment.c:275 [inline]
    ip_defrag+0x448/0x67a0 net/ipv4/ip_fragment.c:676
    ip_check_defrag+0x775/0xda0 net/ipv4/ip_fragment.c:724
    packet_rcv_fanout+0x2a8/0x8d0 net/packet/af_packet.c:1447
    deliver_skb net/core/dev.c:1897 [inline]
    deliver_ptype_list_skb net/core/dev.c:1912 [inline]
    __netif_receive_skb_core+0x314a/0x4a80 net/core/dev.c:4545
    __netif_receive_skb net/core/dev.c:4627 [inline]
    netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
    netif_receive_skb+0x230/0x240 net/core/dev.c:4725
    tun_rx_batched drivers/net/tun.c:1555 [inline]
    tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
    tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
    do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
    do_iter_write+0x30d/0xd40 fs/read_write.c:932
    vfs_writev fs/read_write.c:977 [inline]
    do_writev+0x3c9/0x830 fs/read_write.c:1012
    SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
    SyS_writev+0x56/0x80 fs/read_write.c:1082
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

29 Sep, 2017

1 commit

  • My prior fix was not complete, as we were dereferencing a pointer
    three times per node, not twice as I initially thought.

    Fixes: 4cc5b44b29a9 ("inetpeer: fix RCU lookup()")
    Fixes: b145425f269a ("inetpeer: remove AVL implementation in favor of RB tree")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Sep, 2017

1 commit

  • Excess of seafood or something happened while I cooked the commit
    adding RB tree to inetpeer.

    Of course, RCU rules need to be respected or bad things can happen.

    In this particular loop, we need to read *pp once per iteration, not
    twice.

    Fixes: b145425f269a ("inetpeer: remove AVL implementation in favor of RB tree")
    Reported-by: John Sperbeck
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jul, 2017

1 commit

  • As discussed in Faro during Netfilter Workshop 2017, RB trees can be
    used with RCU, using a seqlock.

    Note that net/rxrpc/conn_service.c is already using this.

    This patch converts inetpeer from AVL tree to RB tree, since it allows
    to remove private AVL implementation in favor of shared RB code.

    $ size net/ipv4/inetpeer.before net/ipv4/inetpeer.after
    text data bss dec hex filename
    3195 40 128 3363 d23 net/ipv4/inetpeer.before
    1562 24 0 1586 632 net/ipv4/inetpeer.after

    The same technique can be used to speed up
    net/netfilter/nft_set_rbtree.c (removing rwlock contention in fast path)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.
    This conversion requires overall +1 on the whole
    refcounting scheme.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

29 Aug, 2015

1 commit


09 Sep, 2014

1 commit

  • inetpeer sequence numbers are no longer incremented, so no need to
    check and flush the tree. The function that increments the sequence
    number was already dead code and removed in in "ipv4: remove unused
    function" (068a6e18). Remove the code that checks for a change, too.

    Verifying that v4_seq and v6_seq are never incremented and thus that
    flush_check compares bp->flush_seq to 0 is trivial.

    The second part of the change removes flush_check completely even
    though bp->flush_seq is exactly !0 once, at initialization. This
    change is correct because the time this branch is true is when
    bp->root == peer_avl_empty_rcu, in which the branch and
    inetpeer_invalidate_tree are a NOOP.

    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

13 Jun, 2014

1 commit

  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     

03 Jun, 2014

1 commit

  • Ideally, we would need to generate IP ID using a per destination IP
    generator.

    linux kernels used inet_peer cache for this purpose, but this had a huge
    cost on servers disabling MTU discovery.

    1) each inet_peer struct consumes 192 bytes

    2) inetpeer cache uses a binary tree of inet_peer structs,
    with a nominal size of ~66000 elements under load.

    3) lookups in this tree are hitting a lot of cache lines, as tree depth
    is about 20.

    4) If server deals with many tcp flows, we have a high probability of
    not finding the inet_peer, allocating a fresh one, inserting it in
    the tree with same initial ip_id_count, (cf secure_ip_id())

    5) We garbage collect inet_peer aggressively.

    IP ID generation do not have to be 'perfect'

    Goal is trying to avoid duplicates in a short period of time,
    so that reassembly units have a chance to complete reassembly of
    fragments belonging to one message before receiving other fragments
    with a recycled ID.

    We simply use an array of generators, and a Jenkin hash using the dst IP
    as a key.

    ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
    belongs (it is only used from this file)

    secure_ip_id() and secure_ipv6_id() no longer are needed.

    Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
    unnecessary decrement/increment of the number of segments.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Apr, 2014

1 commit

  • Do not initialize list twice.
    list_replace_init() already takes care of initializing list.
    We don't need to initialize it with LIST_HEAD() beforehand.

    Signed-off-by: xiao jin
    Reviewed-by: David Cohen
    Signed-off-by: David S. Miller

    xiao jin
     

18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Dec, 2013

1 commit


27 Dec, 2013

1 commit


20 Sep, 2013

1 commit

  • If local fragmentation is allowed, then ip_select_ident() and
    ip_select_ident_more() need to generate unique IDs to ensure
    correct defragmentation on the peer.

    For example, if IPsec (tunnel mode) has to encrypt large skbs
    that have local_df bit set, then all IP fragments that belonged
    to different ESP datagrams would have used the same identificator.
    If one of these IP fragments would get lost or reordered, then
    peer could possibly stitch together wrong IP fragments that did
    not belong to the same datagram. This would lead to a packet loss
    or data corruption.

    Signed-off-by: Ansis Atteka
    Signed-off-by: David S. Miller

    Ansis Atteka
     

03 Oct, 2012

1 commit

  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

28 Sep, 2012

1 commit

  • When jiffies wraps around (for example, 5 minutes after the boot, see
    INITIAL_JIFFIES) and peer has just been created, now - peer->rate_last can be
    < XRLIM_BURST_FACTOR * timeout, so token is not set to the maximum value, thus
    some icmp packets can be unexpectedly dropped.

    Fix this case by initializing last_rate to 60 seconds in the past.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

22 Aug, 2012

1 commit


11 Jul, 2012

2 commits


21 Jun, 2012

1 commit

  • No need to use cmpxchg() in inetpeer_invalidate_tree() since we hold
    base lock.

    Also use correct rcu annotations to remove sparse errors
    (CONFIG_SPARSE_RCU_POINTER=y)

    net/ipv4/inetpeer.c:144:19: error: incompatible types in comparison
    expression (different address spaces)
    net/ipv4/inetpeer.c:149:20: error: incompatible types in comparison
    expression (different address spaces)
    net/ipv4/inetpeer.c:595:10: error: incompatible types in comparison
    expression (different address spaces)

    Signed-off-by: Eric Dumazet
    Cc: Steffen Klassert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jun, 2012

1 commit

  • This implementation can deal with having many inetpeer roots, which is
    a necessary prerequisite for per-FIB table rooted peer tables.

    Each family (AF_INET, AF_INET6) has a sequence number which we bump
    when we get a family invalidation request.

    Each peer lookup cheaply checks whether the flush sequence of the
    root we are using is out of date, and if so flushes it and updates
    the sequence number.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jun, 2012

3 commits


09 Jun, 2012

1 commit

  • now inetpeer doesn't support namespace,the information will
    be leaking across namespace.

    this patch move the global vars v4_peers and v6_peers to
    netns_ipv4 and netns_ipv6 as a field peers.

    add struct pernet_operations inetpeer_ops to initial pernet
    inetpeer data.

    and change family_to_base and inet_getpeer to support namespace.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     

07 Jun, 2012

1 commit

  • commit 5faa5df1fa2024 (inetpeer: Invalidate the inetpeer tree along with
    the routing cache) added a race :

    Before freeing an inetpeer, we must respect a RCU grace period, and make
    sure no user will attempt to increase refcnt.

    inetpeer_invalidate_tree() waits for a RCU grace period before inserting
    inetpeer tree into gc_list and waking the worker. At that time, no
    concurrent lookup can find a inetpeer in this tree.

    Signed-off-by: Eric Dumazet
    Cc: Steffen Klassert
    Acked-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Mar, 2012

2 commits

  • As we invalidate the inetpeer tree along with the routing cache now,
    we don't need a genid to reset the redirect handling when the routing
    cache is flushed.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • We initialize the routing metrics with the values cached on the
    inetpeer in rt_init_metrics(). So if we have the metrics cached on the
    inetpeer, we ignore the user configured fib_metrics.

    To fix this issue, we replace the old tree with a fresh initialized
    inet_peer_base. The old tree is removed later with a delayed work queue.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     

18 Jan, 2012

1 commit


17 Jan, 2012

1 commit


07 Aug, 2011

1 commit

  • Computers have become a lot faster since we compromised on the
    partial MD4 hash which we use currently for performance reasons.

    MD5 is a much safer choice, and is inline with both RFC1948 and
    other ISS generators (OpenBSD, Solaris, etc.)

    Furthermore, only having 24-bits of the sequence number be truly
    unpredictable is a very serious limitation. So the periodic
    regeneration and 8-bit counter have been removed. We compute and
    use a full 32-bit sequence number.

    For ipv6, DCCP was found to use a 32-bit truncated initial sequence
    number (it needs 43-bits) and that is fixed here as well.

    Reported-by: Dan Kaminsky
    Tested-by: Willy Tarreau
    Signed-off-by: David S. Miller

    David S. Miller
     

22 Jul, 2011

1 commit

  • IPv6 fragment identification generation is way beyond what we use for
    IPv4 : It uses a single generator. Its not scalable and allows DOS
    attacks.

    Now inetpeer is IPv6 aware, we can use it to provide a more secure and
    scalable frag ident generator (per destination, instead of system wide)

    This patch :
    1) defines a new secure_ipv6_id() helper
    2) extends inet_getid() to provide 32bit results
    3) extends ipv6_select_ident() with a new dest parameter

    Reported-by: Fernando Gont
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Jul, 2011

1 commit

  • We currently can free inetpeer entries too early :

    [ 782.636674] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f130f44c)
    [ 782.636677] 1f7b13c100000000000000000000000002000000000000000000000000000000
    [ 782.636686] i i i i u u u u i i i i u u u u i i i i u u u u u u u u u u u u
    [ 782.636694] ^
    [ 782.636696]
    [ 782.636698] Pid: 4638, comm: ssh Not tainted 3.0.0-rc5+ #270 Hewlett-Packard HP Compaq 6005 Pro SFF PC/3047h
    [ 782.636702] EIP: 0060:[] EFLAGS: 00010286 CPU: 0
    [ 782.636707] EIP is at inet_getpeer+0x25b/0x5a0
    [ 782.636709] EAX: 00000002 EBX: 00010080 ECX: f130f3c0 EDX: f0209d30
    [ 782.636711] ESI: 0000bc87 EDI: 0000ea60 EBP: f0209ddc ESP: c173134c
    [ 782.636712] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    [ 782.636714] CR0: 8005003b CR2: f0beca80 CR3: 30246000 CR4: 000006d0
    [ 782.636716] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    [ 782.636717] DR6: ffff4ff0 DR7: 00000400
    [ 782.636718] [] rt_set_nexthop.clone.45+0x56/0x220
    [ 782.636722] [] __ip_route_output_key+0x309/0x860
    [ 782.636724] [] tcp_v4_connect+0x124/0x450
    [ 782.636728] [] inet_stream_connect+0xa3/0x270
    [ 782.636731] [] sys_connect+0xa1/0xb0
    [ 782.636733] [] sys_socketcall+0x25d/0x2a0
    [ 782.636736] [] sysenter_do_call+0x12/0x28
    [ 782.636738] [] 0xffffffff

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2011

1 commit

  • Andi Kleen and Tim Chen reported huge contention on inetpeer
    unused_peers.lock, on memcached workload on a 40 core machine, with
    disabled route cache.

    It appears we constantly flip peers refcnt between 0 and 1 values, and
    we must insert/remove peers from unused_peers.list, holding a contended
    spinlock.

    Remove this list completely and perform a garbage collection on-the-fly,
    at lookup time, using the expired nodes we met during the tree
    traversal.

    This removes a lot of code, makes locking more standard, and obsoletes
    two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
    removes two pointers in inet_peer structure.

    There is still a false sharing effect because refcnt is in first cache
    line of object [were the links and keys used by lookups are located], we
    might move it at the end of inet_peer structure to let this first cache
    line mostly read by cpus.

    Signed-off-by: Eric Dumazet
    CC: Andi Kleen
    CC: Tim Chen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 May, 2011

1 commit

  • Several crashes in cleanup_once() were reported in recent kernels.

    Commit d6cc1d642de9 (inetpeer: various changes) added a race in
    unlink_from_unused().

    One way to avoid taking unused_peers.lock before doing the list_empty()
    test is to catch 0->1 refcnt transitions, using full barrier atomic
    operations variants (atomic_cmpxchg() and atomic_inc_return()) instead
    of previous atomic_inc() and atomic_add_unless() variants.

    We then call unlink_from_unused() only for the owner of the 0->1
    transition.

    Add a new atomic_add_unless_return() static helper

    With help from Arun Sharma.

    Refs: https://bugzilla.kernel.org/show_bug.cgi?id=32772

    Reported-by: Arun Sharma
    Reported-by: Maximilian Engelhardt
    Reported-by: Yann Dupont
    Reported-by: Denys Fedoryshchenko
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Apr, 2011

1 commit

  • On 64bit arches, we use 752 bytes of stack when cleanup_once() is called
    from inet_getpeer().

    Lets share the avl stack to save ~376 bytes.

    Before patch :

    # objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl

    0x000006c3 unlink_from_pool [inetpeer.o]: 376
    0x00000721 unlink_from_pool [inetpeer.o]: 376
    0x00000cb1 inet_getpeer [inetpeer.o]: 376
    0x00000e6d inet_getpeer [inetpeer.o]: 376
    0x0004 inet_initpeers [inetpeer.o]: 112
    # size net/ipv4/inetpeer.o
    text data bss dec hex filename
    5320 432 21 5773 168d net/ipv4/inetpeer.o

    After patch :

    objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl
    0x00000c11 inet_getpeer [inetpeer.o]: 376
    0x00000dcd inet_getpeer [inetpeer.o]: 376
    0x00000ab9 peer_check_expire [inetpeer.o]: 328
    0x00000b7f peer_check_expire [inetpeer.o]: 328
    0x0004 inet_initpeers [inetpeer.o]: 112
    # size net/ipv4/inetpeer.o
    text data bss dec hex filename
    5163 432 21 5616 15f0 net/ipv4/inetpeer.o

    Signed-off-by: Eric Dumazet
    Cc: Scot Doyle
    Cc: Stephen Hemminger
    Cc: Hiroaki SHIMODA
    Reviewed-by: Hiroaki SHIMODA
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Mar, 2011

2 commits

  • After commit 7b46ac4e77f3224a (inetpeer: Don't disable BH for initial
    fast RCU lookup.), we should use call_rcu() to wait proper RCU grace
    period.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4
    (Destination unreachable (Fragmentation needed)),

    icmp_unreach
    -> ip_rt_frag_needed
    (peer->pmtu_expires is set here)
    -> tcp_v4_err
    -> do_pmtu_discovery
    -> ip_rt_update_pmtu
    (peer->pmtu_expires is already set,
    so check_peer_pmtu is skipped.)
    -> check_peer_pmtu

    check_peer_pmtu is skipped and MTU is not updated.

    To fix this, let check_peer_pmtu execute unconditionally.
    And some minor fixes
    1) Avoid potential peer->pmtu_expires set to be zero.
    2) In check_peer_pmtu, argument of time_before is reversed.
    3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero,
    but not initialized.

    Signed-off-by: Hiroaki SHIMODA
    Signed-off-by: David S. Miller

    Hiroaki SHIMODA
     

09 Mar, 2011

1 commit


05 Mar, 2011

1 commit

  • David noticed :

    ------------------
    Eric, I was profiling the non-routing-cache case and something that
    stuck out is the case of calling inet_getpeer() with create==0.

    If an entry is not found, we have to redo the lookup under a spinlock
    to make certain that a concurrent writer rebalancing the tree does
    not "hide" an existing entry from us.

    This makes the case of a create==0 lookup for a not-present entry
    really expensive. It is on the order of 600 cpu cycles on my
    Niagara2.

    I added a hack to not do the relookup under the lock when create==0
    and it now costs less than 300 cycles.

    This is now a pretty common operation with the way we handle COW'd
    metrics, so I think it's definitely worth optimizing.
    -----------------

    One solution is to use a seqlock instead of a spinlock to protect struct
    inet_peer_base.

    After a failed avl tree lookup, we can easily detect if a writer did
    some changes during our lookup. Taking the lock and redo the lookup is
    only necessary in this case.

    Note: Add one private rcu_deref_locked() macro to place in one spot the
    access to spinlock included in seqlock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet