14 Nov, 2008

1 commit

  • During tbench/oprofile sessions, I found that dst_release() was in third position.

    CPU: Core 2, speed 2999.68 MHz (estimated)
    Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
    samples % symbol name
    483726 9.0185 __copy_user_zeroing_intel
    191466 3.5697 __copy_user_intel
    185475 3.4580 dst_release
    175114 3.2648 ip_queue_xmit
    153447 2.8608 tcp_sendmsg
    108775 2.0280 tcp_recvmsg
    102659 1.9140 sysenter_past_esp
    101450 1.8914 tcp_current_mss
    95067 1.7724 __copy_from_user_ll
    86531 1.6133 tcp_transmit_skb

    Of course, all CPUS fight on the dst_entry associated with 127.0.0.1

    Instead of first checking the refcount value, then decrement it,
    we use atomic_dec_return() to help CPU to make the right memory transaction
    (ie getting the cache line in exclusive mode)

    dst_release() is now at the fifth position, and tbench a litle bit faster ;)

    CPU: Core 2, speed 3000.1 MHz (estimated)
    Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
    samples % symbol name
    647107 8.8072 __copy_user_zeroing_intel
    258840 3.5229 ip_queue_xmit
    258302 3.5155 __copy_user_intel
    209629 2.8531 tcp_sendmsg
    165632 2.2543 dst_release
    149232 2.0311 tcp_current_mss
    147821 2.0119 tcp_recvmsg
    137893 1.8767 sysenter_past_esp
    127473 1.7349 __copy_from_user_ll
    121308 1.6510 ip_finish_output
    118510 1.6129 tcp_transmit_skb
    109295 1.4875 tcp_v4_rcv

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Sep, 2008

1 commit

  • The dst garbage collector dst_gc_task() may not be scheduled as we
    expect it to be in __dst_free().

    Indeed, when the dst_gc_timer was replaced by the delayed_work
    dst_gc_work, the mod_timer() call used to schedule the garbage
    collector at an earlier date was replaced by a schedule_delayed_work()
    (see commit 86bba269d08f0c545ae76c90b56727f65d62d57f).

    But, the behaviour of mod_timer() and schedule_delayed_work() is
    different in the way they handle the delay.

    mod_timer() stops the timer and re-arm it with the new given delay,
    whereas schedule_delayed_work() only check if the work is already
    queued in the workqueue (and queue it (with delay) if it is not)
    BUT it does NOT take into account the new delay (even if the new delay
    is earlier in time).
    schedule_delayed_work() returns 0 if it didn't queue the work,
    but we don't check the return code in __dst_free().

    If I understand the code in __dst_free() correctly, we want dst_gc_task
    to be queued after DST_GC_INC jiffies if we pass the test (and not in
    some undetermined time in the future), so I think we should add a call
    to cancel_delayed_work() before schedule_delayed_work(). Patch below.

    Or we should at least test the return code of schedule_delayed_work(),
    and reset the values of dst_garbage.timer_inc and dst_garbage.timer_expires
    back to their former values if schedule_delayed_work() failed.
    Otherwise the subsequent calls to __dst_free will test the wrong values
    and assume wrong thing about when the garbage collector is supposed to
    be scheduled.

    dst_gc_task() also calls schedule_delayed_work() without checking
    its return code (or calling cancel_scheduled_work() first), but it
    should fine there: dst_gc_task is the routine of the delayed_work, so
    no dst_gc_work should be pending in the queue when it's running.

    Signed-off-by: Benjamin Thery
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Benjamin Thery
     

28 Mar, 2008

1 commit

  • Codiff stats (allyesconfig, v2.6.24-mm1):
    -16420 187 funcs, 103 +, 16523 -, diff: -16420 --- dst_release

    Without number of debug related CONFIGs (v2.6.25-rc2-mm1):
    -7257 186 funcs, 70 +, 7327 -, diff: -7257 --- dst_release
    dst_release | +40

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

26 Mar, 2008

1 commit


29 Feb, 2008

1 commit


29 Jan, 2008

4 commits

  • The garbage collection function receive the dst_ops structure as
    parameter. This is useful for the next incoming patchset because it
    will need the dst_ops (there will be several instances) and the
    network namespace pointer (contained in the dst_ops).

    The protocols which do not take care of the namespaces will not be
    impacted by this change (expect for the function signature), they do
    just ignore the parameter.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • This cleanup shrinks size of net/core/dst.o on i386 from 1299 to 1289 bytes.
    (This is because dev_hold()/dev_put() are doing atomic_inc()/atomic_dec() and
    force compiler to re-evaluate memory contents.)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Move dst entries to a namespace loopback to catch refcounting leaks.

    Signed-off-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • We have a number of copies of dst_discard scattered around the place
    which all do the same thing, namely free a packet on the input or
    output paths.

    This patch deletes all of them except dst_discard and points all the
    users to it.

    The only non-trivial bit is decnet where it returns an error.
    However, conceptually this is identical to the blackhole functions
    used in IPv4 and IPv6 which do not return errors. So they should
    either all return errors or all return zero. For now I've stuck with
    the majority and picked zero as the return value.

    It doesn't really matter in practice since few if any driver would
    react differently depending on a zero return value or NET_RX_DROP.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

07 Nov, 2007

1 commit


11 Oct, 2007

4 commits

  • This patch makes loopback_dev per network namespace. Adding
    code to create a different loopback device for each network
    namespace and adding the code to free a loopback device
    when a network namespace exits.

    This patch modifies all users the loopback_dev so they
    access it as init_net.loopback_dev, keeping all of the
    code compiling and working. A later pass will be needed to
    update the users to use something other than the initial network
    namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch replaces all occurences to the static variable
    loopback_dev to a pointer loopback_dev. That provides the
    mindless, trivial, uninteressting change part for the dynamic
    allocation for the loopback.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Acked-By: Kirill Korotaev
    Acked-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • When the periodic IP route cache flush is done (every 600 seconds on
    default configuration), some hosts suffer a lot and eventually trigger
    the "soft lockup" message.

    dst_run_gc() is doing a scan of a possibly huge list of dst_entries,
    eventually freeing some (less than 1%) of them, while holding the
    dst_lock spinlock for the whole scan.

    Then it rearms a timer to redo the full thing 1/10 s later...
    The slowdown can last one minute or so, depending on how active are
    the tcp sessions.

    This second version of the patch converts the processing from a softirq
    based one to a workqueue.

    Even if the list of entries in garbage_list is huge, host is still
    responsive to softirqs and can make progress.

    Instead of resetting gc timer to 0.1 second if one entry was freed in a
    gc run, we do this if more than 10% of entries were freed.

    Before patch :

    Aug 16 06:21:37 SRV1 kernel: BUG: soft lockup detected on CPU#0!
    Aug 16 06:21:37 SRV1 kernel:
    Aug 16 06:21:37 SRV1 kernel: Call Trace:
    Aug 16 06:21:37 SRV1 kernel: [] wake_up_process+0x10/0x20
    Aug 16 06:21:37 SRV1 kernel: [] softlockup_tick+0xe9/0x110
    Aug 16 06:21:37 SRV1 kernel: [] dst_run_gc+0x0/0x140
    Aug 16 06:21:37 SRV1 kernel: [] run_local_timers+0x13/0x20
    Aug 16 06:21:37 SRV1 kernel: [] update_process_times+0x57/0x90
    Aug 16 06:21:37 SRV1 kernel: [] smp_local_timer_interrupt+0x34/0x60
    Aug 16 06:21:37 SRV1 kernel: [] smp_apic_timer_interrupt+0x5c/0x80
    Aug 16 06:21:37 SRV1 kernel: [] apic_timer_interrupt+0x66/0x70
    Aug 16 06:21:37 SRV1 kernel: [] dst_run_gc+0x53/0x140
    Aug 16 06:21:37 SRV1 kernel: [] dst_run_gc+0x46/0x140
    Aug 16 06:21:37 SRV1 kernel: [] run_timer_softirq+0x148/0x1c0
    Aug 16 06:21:37 SRV1 kernel: [] __do_softirq+0x6c/0xe0
    Aug 16 06:21:37 SRV1 kernel: [] call_softirq+0x1c/0x30
    Aug 16 06:21:37 SRV1 kernel: [] do_softirq+0x34/0x90
    Aug 16 06:21:37 SRV1 kernel: [] local_bh_enable_ip+0x3f/0x60
    Aug 16 06:21:37 SRV1 kernel: [] _spin_unlock_bh+0x13/0x20
    Aug 16 06:21:37 SRV1 kernel: [] rt_garbage_collect+0x1d8/0x320
    Aug 16 06:21:37 SRV1 kernel: [] dst_alloc+0x1d/0xa0
    Aug 16 06:21:37 SRV1 kernel: [] __ip_route_output_key+0x573/0x800
    Aug 16 06:21:37 SRV1 kernel: [] sock_common_recvmsg+0x32/0x50
    Aug 16 06:21:37 SRV1 kernel: [] ip_route_output_flow+0x1c/0x60
    Aug 16 06:21:37 SRV1 kernel: [] tcp_v4_connect+0x150/0x610
    Aug 16 06:21:37 SRV1 kernel: [] inet_bind_bucket_create+0x17/0x60
    Aug 16 06:21:37 SRV1 kernel: [] inet_stream_connect+0xa6/0x2c0
    Aug 16 06:21:37 SRV1 kernel: [] _spin_lock_bh+0x11/0x30
    Aug 16 06:21:37 SRV1 kernel: [] lock_sock_nested+0xcf/0xe0
    Aug 16 06:21:37 SRV1 kernel: [] _spin_lock_bh+0x11/0x30
    Aug 16 06:21:37 SRV1 kernel: [] sys_connect+0x71/0xa0
    Aug 16 06:21:37 SRV1 kernel: [] tcp_setsockopt+0x1f/0x30
    Aug 16 06:21:37 SRV1 kernel: [] sock_common_setsockopt+0xf/0x20
    Aug 16 06:21:37 SRV1 kernel: [] sys_setsockopt+0x9d/0xc0
    Aug 16 06:21:37 SRV1 kernel: [] sys_ioctl+0x5e/0x80
    Aug 16 06:21:37 SRV1 kernel: [] system_call+0x7e/0x83

    After patch : (RT_CACHE_DEBUG set to 2 to get following traces)

    dst_total: 75469 delayed: 74109 work_perf: 141 expires: 150 elapsed: 8092 us
    dst_total: 78725 delayed: 73366 work_perf: 743 expires: 400 elapsed: 8542 us
    dst_total: 86126 delayed: 71844 work_perf: 1522 expires: 775 elapsed: 8849 us
    dst_total: 100173 delayed: 68791 work_perf: 3053 expires: 1256 elapsed: 9748 us
    dst_total: 121798 delayed: 64711 work_perf: 4080 expires: 1997 elapsed: 10146 us
    dst_total: 154522 delayed: 58316 work_perf: 6395 expires: 25 elapsed: 11402 us
    dst_total: 154957 delayed: 58252 work_perf: 64 expires: 150 elapsed: 6148 us
    dst_total: 157377 delayed: 57843 work_perf: 409 expires: 400 elapsed: 6350 us
    dst_total: 163745 delayed: 56679 work_perf: 1164 expires: 775 elapsed: 7051 us
    dst_total: 176577 delayed: 53965 work_perf: 2714 expires: 1389 elapsed: 8120 us
    dst_total: 198993 delayed: 49627 work_perf: 4338 expires: 1997 elapsed: 8909 us
    dst_total: 226638 delayed: 46865 work_perf: 2762 expires: 2748 elapsed: 7351 us

    I successfully reduced the IP route cache of many hosts by a four factor
    thanks to this patch. Previously, I had to disable "ip route flush cache"
    to avoid crashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Every user of the network device notifiers is either a protocol
    stack or a pseudo device. If a protocol stack that does not have
    support for multiple network namespaces receives an event for a
    device that is not in the initial network namespace it quite possibly
    can get confused and do the wrong thing.

    To avoid problems until all of the protocol stacks are converted
    this patch modifies all netdev event handlers to ignore events on
    devices that are not in the initial network namespace.

    As the rest of the code is made network namespace aware these
    checks can be removed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

08 Jun, 2007

1 commit


15 Feb, 2007

1 commit

  • After Al Viro (finally) succeeded in removing the sched.h #include in module.h
    recently, it makes sense again to remove other superfluous sched.h includes.
    There are quite a lot of files which include it but don't actually need
    anything defined in there. Presumably these includes were once needed for
    macros that used to live in sched.h, but moved to other header files in the
    course of cleaning it up.

    To ease the pain, this time I did not fiddle with any header files and only
    removed #includes from .c-files, which tend to cause less trouble.

    Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
    arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
    allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
    configs in arch/arm/configs on arm. I also checked that no new warnings were
    introduced by the patch (actually, some warnings are removed that were emitted
    by unnecessarily included header files).

    Signed-off-by: Tim Schmielau
    Acked-by: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Schmielau
     

12 Feb, 2007

2 commits

  • * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (45 commits)
    [IPV4]: Restore multipath routing after rt_next changes.
    [XFRM] IPV6: Fix outbound RO transformation which is broken by IPsec tunnel patch.
    [NET]: Reorder fields of struct dst_entry
    [DECNET]: Convert decnet route to use the new dst_entry 'next' pointer
    [IPV6]: Convert ipv6 route to use the new dst_entry 'next' pointer
    [IPV4]: Convert ipv4 route to use the new dst_entry 'next' pointer
    [NET]: Introduce union in struct dst_entry to hold 'next' pointer
    [DECNET]: fix misannotation of linkinfo_dn
    [DECNET]: FRA_{DST,SRC} are le16 for decnet
    [UDP]: UDP can use sk_hash to speedup lookups
    [NET]: Fix whitespace errors.
    [NET] XFRM: Fix whitespace errors.
    [NET] X25: Fix whitespace errors.
    [NET] WANROUTER: Fix whitespace errors.
    [NET] UNIX: Fix whitespace errors.
    [NET] TIPC: Fix whitespace errors.
    [NET] SUNRPC: Fix whitespace errors.
    [NET] SCTP: Fix whitespace errors.
    [NET] SCHED: Fix whitespace errors.
    [NET] RXRPC: Fix whitespace errors.
    ...

    Linus Torvalds
     
  • Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
    corresponding "kmem_cache_zalloc()" call.

    Signed-off-by: Robert P. J. Day
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Cc: Roland McGrath
    Cc: James Bottomley
    Cc: Greg KH
    Acked-by: Joel Becker
    Cc: Steven Whitehouse
    Cc: Jan Kara
    Cc: Michael Halcrow
    Cc: "David S. Miller"
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     

11 Feb, 2007

1 commit


09 Feb, 2007

1 commit

  • This patch introduces users of the round_jiffies() function in the
    networking code.

    These timers all were of the "about once a second" or "about once
    every X seconds" variety and several showed up in the "what wakes the
    cpu up" profiles that the tickless patches provide. Some timers are
    highly dynamic based on network load; but even on low activity systems
    they still show up so the rounding is done only in cases of low
    activity, allowing higher frequency timers in the high activity case.

    The various hardware watchdogs are an obvious case; they run every 2
    seconds but aren't otherwise specific of exactly when they need to
    run.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

08 Dec, 2006

1 commit


09 Aug, 2006

1 commit

  • Patch from Dmitry Mishin :

    Replace add_timer() by mod_timer() in dst_run_gc
    in order to avoid BUG message.

    CPU1 CPU2
    dst_run_gc() entered dst_run_gc() entered
    spin_lock(&dst_lock) .....
    del_timer(&dst_gc_timer) fail to get lock
    .... mod_timer()
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Dmitry Mishin
     

10 Sep, 2005

1 commit


31 Jul, 2005

1 commit

  • The bug is evident when it is seen once. dst gc timer was backed off,
    when gc queue is not empty. But this means that timer quickly backs off,
    if at least one destination remains in use. Normally, the bug is invisible,
    because adding new dst entry to queue cancels the backoff. But it shots
    deadly with destination cache overflow when new destinations are not released
    for long time f.e. after an interface goes down.

    The fix is to cancel backoff when something was released.

    Signed-off-by: Denis Lunev
    Signed-off-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis Lunev
     

17 Apr, 2005

2 commits

  • When we are not the real parent of the dst (e.g., when we're xfrm_dst and
    the child is an rtentry), it may already be on the GC list.

    In fact the current code is buggy to, we need to check dst->flags before
    the dec as dst may no longer be valid afterwards.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Herbert Xu
     
  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds