10 Dec, 2014

1 commit

  • Since commit f8864972126899 ("ipv4: fix dst race in sk_dst_get()")
    DST_NOCACHE dst_entries get freed by RCU. So there is no need to get a
    reference on them when we are in rcu protected sections.

    Cc: Eric Dumazet
    Cc: Julian Anastasov
    Signed-off-by: Hannes Frederic Sowa
    Reviewed-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

26 Jun, 2014

1 commit

  • When IP route cache had been removed in linux-3.6, we broke assumption
    that dst entries were all freed after rcu grace period. DST_NOCACHE
    dst were supposed to be freed from dst_release(). But it appears
    we want to keep such dst around, either in UDP sockets or tunnels.

    In sk_dst_get() we need to make sure dst refcount is not 0
    before incrementing it, or else we might end up freeing a dst
    twice.

    DST_NOCACHE set on a dst does not mean this dst can not be attached
    to a socket or a tunnel.

    Then, before actual freeing, we need to observe a rcu grace period
    to make sure all other cpus can catch the fact the dst is no longer
    usable.

    Signed-off-by: Eric Dumazet
    Reported-by: Dormando
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Apr, 2014

1 commit

  • In the dst->output() path for ipv4, the code assumes the skb it has to
    transmit is attached to an inet socket, specifically via
    ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
    provider of the packet is an AF_PACKET socket.

    The dst->output() method gets an additional 'struct sock *sk'
    parameter. This needs a cascade of changes so that this parameter can
    be propagated from vxlan to final consumer.

    Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
    Reported-by: lucien xin
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 May, 2013

1 commit

  • So far, only net_device * could be passed along with netdevice notifier
    event. This patch provides a possibility to pass custom structure
    able to provide info that event listener needs to know.

    Signed-off-by: Jiri Pirko

    v2->v3: fix typo on simeth
    shortened dev_getter
    shortened notifier_info struct name
    v1->v2: fix notifier_call parameter in call_netdevice_notifier()
    Signed-off-by: David S. Miller

    Jiri Pirko
     

02 Apr, 2013

1 commit

  • Rename skb_dst_set_noref to __skb_dst_set_noref and
    add force flag as suggested by David Miller. The new wrapper
    skb_dst_set_noref_force will force dst entries that are not
    cached to be attached as skb dst without taking reference
    as long as provided dst is reclaimed after RCU grace period.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Acked-by: David S. Miller
    Signed-off-by: Simon Horman

    Julian Anastasov
     

21 Feb, 2013

1 commit

  • Eric Dumazet wrote:
    | Some strange crashes happen in rt6_check_expired(), with access
    | to random addresses.
    |
    | At first glance, it looks like the RTF_EXPIRES and
    | stuff added in commit 1716a96101c49186b
    | (ipv6: fix problem with expired dst cache)
    | are racy : same dst could be manipulated at the same time
    | on different cpus.
    |
    | At some point, our stack believes rt->dst.from contains a dst pointer,
    | while its really a jiffie value (as rt->dst.expires shares the same area
    | of memory)
    |
    | rt6_update_expires() should be fixed, or am I missing something ?
    |
    | CC Neil because of https://bugzilla.redhat.com/show_bug.cgi?id=892060

    Because we do not have any locks for dst_entry, we cannot change
    essential structure in the entry; e.g., we cannot change reference
    to other entity.

    To fix this issue, split 'from' and 'expires' field in dst_entry
    out of union. Once it is 'from' is assigned in the constructor,
    keep the reference until the very last stage of the life time of
    the object.

    Of course, it is unsafe to change 'from', so make rt6_set_from simple
    just for fresh entries.

    Reported-by: Eric Dumazet
    Reported-by: Neil Horman
    CC: Gao Feng
    Signed-off-by: YOSHIFUJI Hideaki
    Reviewed-by: Eric Dumazet
    Reported-by: Steinar H. Gunderson
    Reviewed-by: Neil Horman
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki / 吉藤英明
     

03 Oct, 2012

2 commits

  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

23 Aug, 2012

1 commit

  • I noticed extra one second delay in device dismantle, tracked down to
    a call to dst_dev_event() while some call_rcu() are still in RCU queues.

    These call_rcu() were posted by rt_free(struct rtable *rt) calls.

    We then wait a little (but one second) in netdev_wait_allrefs() before
    kicking again NETDEV_UNREGISTER.

    As the call_rcu() are now completed, dst_dev_event() can do the needed
    device swap on busy dst.

    To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
    after a rcu_barrier(), but outside of RTNL lock.

    Use NETDEV_UNREGISTER_FINAL with care !

    Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL

    Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
    IP cache removal.

    With help from Gao feng

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Mahesh Bandewar
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Aug, 2012

1 commit

  • Convert delayed_work users doing cancel_delayed_work() followed by
    queue_delayed_work() to mod_delayed_work().

    Most conversions are straight-forward. Ones worth mentioning are,

    * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
    use mod_delayed_work() and cancel loop in
    edac_mc_reset_delay_period() is dropped.

    * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
    watchdog is active or not. @fan_watchdog_active and related code
    dropped.

    * drivers/power/charger-manager.c: Seemingly a lot of
    delayed_work_pending() abuse going on here.
    [delayed_]work_pending() are unsynchronized and racy when used like
    this. I converted one instance in fullbatt_handler(). Please
    conver the rest so that it invokes workqueue APIs for the intended
    target state rather than trying to game work item pending state
    transitions. e.g. if timer should be modified - call
    mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

    * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
    simplified. Note that round_jiffies() calls in this function are
    meaningless. round_jiffies() work on absolute jiffies not delta
    delay used by delayed_work.

    v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work(). They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock. __cancel_delayed_work() users are
    dropped.

    Signed-off-by: Tejun Heo
    Acked-by: Henrique de Moraes Holschuh
    Acked-by: Dmitry Torokhov
    Acked-by: Anton Vorontsov
    Acked-by: David Howells
    Cc: Tomi Valkeinen
    Cc: Jens Axboe
    Cc: Jiri Kosina
    Cc: Doug Thompson
    Cc: David Airlie
    Cc: Roland Dreier
    Cc: "John W. Linville"
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: "J. Bruce Fields"
    Cc: Johannes Berg

    Tejun Heo
     

09 Aug, 2012

1 commit

  • While investigating on network performance problems, I found this little
    gem :

    $ nm -v vmlinux | grep -1 dst_default_metrics
    ffffffff82736540 b busy.46605
    ffffffff82736560 B dst_default_metrics
    ffffffff82736598 b dst_busy_list

    Apparently, declaring a const array without initializer put it in
    (writeable) bss section, in middle of possibly often dirtied cache
    lines.

    Since we really want dst_default_metrics be const to avoid any possible
    false sharing and catch any buggy writes, I force a null initializer.

    ffffffff818a4c20 R dst_default_metrics

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jul, 2012

1 commit


05 Jul, 2012

2 commits


06 Dec, 2011

1 commit


10 Aug, 2011

1 commit


18 Jul, 2011

1 commit


14 Jul, 2011

1 commit

  • Now that there is a one-to-one correspondance between neighbour
    and hh_cache entries, we no longer need:

    1) dynamic allocation
    2) attachment to dst->hh
    3) refcounting

    Initialization of the hh_cache entry is indicated by hh_len
    being non-zero, and such initialization is always done with
    the neighbour's lock held as a writer.

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Jul, 2011

1 commit

  • IPV6, unlike IPV4, doesn't have a routing cache.

    Routing table entries, as well as clones made in response
    to route lookup requests, all live in the same table. And
    all of these things are together collected in the destination
    cache table for ipv6.

    This means that routing table entries count against the garbage
    collection limits, even though such entries cannot ever be reclaimed
    and are added explicitly by the administrator (rather than being
    created in response to lookups).

    Therefore it makes no sense to count ipv6 routing table entries
    against the GC limits.

    Add a DST_NOCOUNT destination cache entry flag, and skip the counting
    if it is set. Use this flag bit in ipv6 when adding routing table
    entries.

    Signed-off-by: David S. Miller

    David S. Miller
     

25 May, 2011

1 commit


21 May, 2011

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
    macvlan: fix panic if lowerdev in a bond
    tg3: Add braces around 5906 workaround.
    tg3: Fix NETIF_F_LOOPBACK error
    macvlan: remove one synchronize_rcu() call
    networking: NET_CLS_ROUTE4 depends on INET
    irda: Fix error propagation in ircomm_lmp_connect_response()
    irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
    irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
    rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
    be2net: Kill set but unused variable 'req' in lancer_fw_download()
    irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
    atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
    rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
    rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
    rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
    rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
    pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
    isdn: capi: Use pr_debug() instead of ifdefs.
    tg3: Update version to 3.119
    tg3: Apply rx_discards fix to 5719/5720
    ...

    Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
    as per Davem.

    Linus Torvalds
     
  • Commit e66eed651fd1 ("list: remove prefetching from regular list
    iterators") removed the include of prefetch.h from list.h, which
    uncovered several cases that had apparently relied on that rather
    obscure header file dependency.

    So this fixes things up a bit, using

    grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
    grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

    to guide us in finding files that either need
    inclusion, or have it despite not needing it.

    There are more of them around (mostly network drivers), but this gets
    many core ones.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 May, 2011

1 commit

  • It's way past it's usefulness. And this gets rid of a bunch
    of stray ->rt_{dst,src} references.

    Even the comment documenting the macro was inaccurate (stated
    default was 1 when it's 0).

    If reintroduced, it should be done properly, with dynamic debug
    facilities.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Apr, 2011

2 commits


18 Feb, 2011

1 commit


29 Jan, 2011

1 commit


27 Jan, 2011

1 commit

  • Routing metrics are now copy-on-write.

    Initially a route entry points it's metrics at a read-only location.
    If a routing table entry exists, it will point there. Else it will
    point at the all zero metric place-holder called 'dst_default_metrics'.

    The writeability state of the metrics is stored in the low bits of the
    metrics pointer, we have two bits left to spare if we want to store
    more states.

    For the initial implementation, COW is implemented simply via kmalloc.
    However future enhancements will change this to place the writable
    metrics somewhere else, in order to increase sharing. Very likely
    this "somewhere else" will be the inetpeer cache.

    Note also that this means that metrics updates may transiently fail
    if we cannot COW the metrics successfully.

    But even by itself, this patch should decrease memory usage and
    increase cache locality especially for routing workloads. In those
    cases the read-only metric copies stay in place and never get written
    to.

    TCP workloads where metrics get updated, and those rare cases where
    PMTU triggers occur, will take a very slight performance hit. But
    that hit will be alleviated when the long-term writable metrics
    move to a more sharable location.

    Since the metrics storage went from a u32 array of RTAX_MAX entries to
    what is essentially a pointer, some retooling of the dst_entry layout
    was necessary.

    Most importantly, we need to preserve the alignment of the reference
    count so that it doesn't share cache lines with the read-mostly state,
    as per Eric Dumazet's alignment assertion checks.

    The only non-trivial bit here is the move of the 'flags' member into
    the writeable cacheline. This is OK since we are always accessing the
    flags around the same moment when we made a modification to the
    reference count.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Nov, 2010

1 commit

  • Followup of commit ef885afbf8a37689 (net: use rcu_barrier() in
    rollback_registered_many)

    dst_dev_event() scans a garbage dst list that might be feeded by various
    network notifiers at device dismantle time.

    Its important to call dst_dev_event() after other notifiers, or we might
    enter the infamous msleep(250) in netdev_wait_allrefs(), and wait one
    second before calling again call_netdevice_notifiers(NETDEV_UNREGISTER,
    dev) to properly remove last device references.

    Use priority -10 to let dst_dev_notifier be called after other network
    notifiers (they have the default 0 priority)

    Reported-by: Ben Greear
    Reported-by: Nicolas Dichtel
    Reported-by: Octavian Purdila
    Reported-by: Benjamin LaHaise
    Tested-by: Ben Greear
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Oct, 2010

1 commit

  • There is no point using RCU for dst we allocate for a very short time
    (used once).

    Change dst_release() to take DST_NOCACHE into account, but also change
    skb_dst_set_noref() to force a refcount increment for such dst.

    This is a _huge_ gain, because we dont waste memory to store xx thousand
    of dsts. Instead of queueing them to RCU, we can free them instantly.

    CPU caches can stay hot, re-using same memory blocks to hold temporary
    dsts.

    Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
    since atomic_dec_return() implies a full memory barrier.

    Stress test, 160.000.000 udp frames sent, IP route cache disabled
    (DDOS).

    Before:

    real 0m38.091s
    user 0m13.189s
    sys 7m53.018s

    After:

    real 0m29.946s
    user 0m12.157s
    sys 7m40.605s

    For reference, if IP route cache was enabled :

    real 0m32.030s
    user 0m10.521s
    sys 8m15.243s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Oct, 2010

2 commits

  • struct dst_ops tracks number of allocated dst in an atomic_t field,
    subject to high cache line contention in stress workload.

    Switch to a percpu_counter, to reduce number of time we need to dirty a
    central location. Place it on a separate cache line to avoid dirtying
    read only fields.

    Stress test :

    (Sending 160.000.000 UDP frames,
    IP route cache disabled, dual E5540 @2.53GHz,
    32bit kernel, FIB_TRIE, SLUB/NUMA)

    Before:

    real 0m51.179s
    user 0m15.329s
    sys 10m15.942s

    After:

    real 0m45.570s
    user 0m15.525s
    sys 9m56.669s

    With a small reordering of struct neighbour fields, subject of a
    following patch, (to separate refcnt from other read mostly fields)

    real 0m41.841s
    user 0m15.261s
    sys 8m45.949s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When a new dst is used to send a frame, neigh_resolve_output() tries to
    associate an struct hh_cache to this dst, calling neigh_hh_init() with
    the neigh rwlock write locked.

    Most of the time, hh_cache is already known and linked into neighbour,
    so we find it and increment its refcount.

    This patch changes the logic so that we call neigh_hh_init() with
    neighbour lock read locked only, so that fast path can be run in
    parallel by concurrent cpus.

    This brings part of the speedup we got with commit c7d4426a98a5f
    (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
    removing one of the contention point that routers hit on multiqueue
    enabled machines.

    Further improvements would need to use a seqlock instead of an rwlock to
    protect neigh->ha[], to not dirty neigh too often and remove two atomic
    ops.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jul, 2010

1 commit


13 Apr, 2010

1 commit


12 Apr, 2010

1 commit


31 Mar, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

09 Feb, 2010

1 commit

  • Kernel bugzilla #15239

    On some workloads, it is quite possible to get a huge dst list to
    process in dst_gc_task(), and trigger soft lockup detection.

    Fix is to call cond_resched(), as we run in process context.

    Reported-by: Pawel Staszewski
    Tested-by: Pawel Staszewski
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Nov, 2008

1 commit

  • During tbench/oprofile sessions, I found that dst_release() was in third position.

    CPU: Core 2, speed 2999.68 MHz (estimated)
    Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
    samples % symbol name
    483726 9.0185 __copy_user_zeroing_intel
    191466 3.5697 __copy_user_intel
    185475 3.4580 dst_release
    175114 3.2648 ip_queue_xmit
    153447 2.8608 tcp_sendmsg
    108775 2.0280 tcp_recvmsg
    102659 1.9140 sysenter_past_esp
    101450 1.8914 tcp_current_mss
    95067 1.7724 __copy_from_user_ll
    86531 1.6133 tcp_transmit_skb

    Of course, all CPUS fight on the dst_entry associated with 127.0.0.1

    Instead of first checking the refcount value, then decrement it,
    we use atomic_dec_return() to help CPU to make the right memory transaction
    (ie getting the cache line in exclusive mode)

    dst_release() is now at the fifth position, and tbench a litle bit faster ;)

    CPU: Core 2, speed 3000.1 MHz (estimated)
    Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
    samples % symbol name
    647107 8.8072 __copy_user_zeroing_intel
    258840 3.5229 ip_queue_xmit
    258302 3.5155 __copy_user_intel
    209629 2.8531 tcp_sendmsg
    165632 2.2543 dst_release
    149232 2.0311 tcp_current_mss
    147821 2.0119 tcp_recvmsg
    137893 1.8767 sysenter_past_esp
    127473 1.7349 __copy_from_user_ll
    121308 1.6510 ip_finish_output
    118510 1.6129 tcp_transmit_skb
    109295 1.4875 tcp_v4_rcv

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Sep, 2008

1 commit

  • The dst garbage collector dst_gc_task() may not be scheduled as we
    expect it to be in __dst_free().

    Indeed, when the dst_gc_timer was replaced by the delayed_work
    dst_gc_work, the mod_timer() call used to schedule the garbage
    collector at an earlier date was replaced by a schedule_delayed_work()
    (see commit 86bba269d08f0c545ae76c90b56727f65d62d57f).

    But, the behaviour of mod_timer() and schedule_delayed_work() is
    different in the way they handle the delay.

    mod_timer() stops the timer and re-arm it with the new given delay,
    whereas schedule_delayed_work() only check if the work is already
    queued in the workqueue (and queue it (with delay) if it is not)
    BUT it does NOT take into account the new delay (even if the new delay
    is earlier in time).
    schedule_delayed_work() returns 0 if it didn't queue the work,
    but we don't check the return code in __dst_free().

    If I understand the code in __dst_free() correctly, we want dst_gc_task
    to be queued after DST_GC_INC jiffies if we pass the test (and not in
    some undetermined time in the future), so I think we should add a call
    to cancel_delayed_work() before schedule_delayed_work(). Patch below.

    Or we should at least test the return code of schedule_delayed_work(),
    and reset the values of dst_garbage.timer_inc and dst_garbage.timer_expires
    back to their former values if schedule_delayed_work() failed.
    Otherwise the subsequent calls to __dst_free will test the wrong values
    and assume wrong thing about when the garbage collector is supposed to
    be scheduled.

    dst_gc_task() also calls schedule_delayed_work() without checking
    its return code (or calling cancel_scheduled_work() first), but it
    should fine there: dst_gc_task is the routine of the delayed_work, so
    no dst_gc_work should be pending in the queue when it's running.

    Signed-off-by: Benjamin Thery
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Benjamin Thery