11 Oct, 2007

2 commits

  • For N cpus, with full throttle traffic on all N CPUs, funneling traffic
    to the same ethernet device, the devices queue lock is contended by all
    N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
    In the current mode of operation, after all the work of entering the
    dequeue region, we may endup aborting the path if we are unable to get
    the tx lock and go back to contend for the queue lock. As N goes up,
    this gets worse.

    The changes in this patch result in a small increase in performance
    with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3
    showed similar behavior;

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • Several devices have multiple independant RX queues per net
    device, and some have a single interrupt doorbell for several
    queues.

    In either case, it's easier to support layouts like that if the
    structure representing the poll is independant from the net
    device itself.

    The signature of the ->poll() call back goes from:

    int foo_poll(struct net_device *dev, int *budget)

    to

    int foo_poll(struct napi_struct *napi, int budget)

    The caller is returned the number of RX packets processed (or
    the number of "NAPI credits" consumed if you want to get
    abstract). The callee no longer messes around bumping
    dev->quota, *budget, etc. because that is all handled in the
    caller upon return.

    The napi_struct is to be embedded in the device driver private data
    structures.

    Furthermore, it is the driver's responsibility to disable all NAPI
    instances in it's ->stop() device close handler. Since the
    napi_struct is privatized into the driver's private data structures,
    only the driver knows how to get at all of the napi_struct instances
    it may have per-device.

    With lots of help and suggestions from Rusty Russell, Roland Dreier,
    Michael Chan, Jeff Garzik, and Jamal Hadi Salim.

    Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
    Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.

    [ Ported to current tree and all drivers converted. Integrated
    Stephen's follow-on kerneldoc additions, and restored poll_list
    handling to the old style to fix mutual exclusion issues. -DaveM ]

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

11 Jul, 2007

5 commits

  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • The generic estimator is always built in anways and all the config options
    does is prevent including a minimal amount of code for setting it up.
    Additionally the option is already automatically selected for most cases.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Changes :

    - netif_queue_stopped need not be called inside qdisc_restart as
    it has been called already in qdisc_run() before the first skb
    is sent, and in __qdisc_run() after each intermediate skb is
    sent (note : we are the only sender, so the queue cannot get
    stopped while the tx lock was got in the ~LLTX case).

    - BUG_ON((int) q->q.qlen < 0) was a relic from old times when -1
    meant more packets are available, and __qdisc_run used to loop
    when qdisc_restart() returned -1. During those days, it was
    necessary to make sure that qlen is never less than zero, since
    __qdisc_run would get into an infinite loop if no packets are on
    the queue and this bug in qdisc was there (and worse - no more
    skbs could ever get queue'd as we hold the queue lock too). With
    Herbert's recent change to return values, this check is not
    required. Hopefully Herbert can validate this change. If at all
    this is required, it should be added to skb_dequeue (in failure
    case), and not to qdisc_qlen.

    Signed-off-by: Krishna Kumar
    Signed-off-by: David S. Miller

    Krishna Kumar
     
  • New changes :

    - Incorporated Peter Waskiewicz's comments.
    - Re-added back one warning message (on driver returning wrong value).

    Previous changes :

    - Converted to use switch/case code which looks neater.

    - "if (ret == NETDEV_TX_LOCKED && lockless)" is buggy, and the lockless
    check should be removed, since driver will return NETDEV_TX_LOCKED only
    if lockless is true and driver has to do the locking. In the original
    code as well as the latest code, this code can result in a bug where
    if LLTX is not set for a driver (lockless == 0) but the driver is written
    wrongly to do a trylock (despite LLTX being set), the driver returns
    LOCKED. But since lockless is zero, the packet is requeue'd instead of
    calling collision code which will issue warning and free up the skb.
    Instead this skb will be retried with this driver next time, and the same
    result will ensue. Removing this check will catch these driver bugs instead
    of hiding the problem. I am keeping this change to readability section
    since :
    a. it is confusing to check two things as it is; and
    b. it is difficult to keep this check in the changed 'switch' code.

    - Changed some names, like try_get_tx_pkt to dev_dequeue_skb (as that is
    the work being done and easier to understand) and do_dev_requeue to
    dev_requeue_skb, merged handle_dev_cpu_collision and tx_islocked to
    dev_handle_collision (handle_dev_cpu_collision is a small routine with only
    one caller, so there is no need to have two separate routines which also
    results in getting rid of two macros, etc.

    - Removed an XXX comment as it should never fail (I suspect this was related
    to batch skb WIP, Jamal ?). Converted some functions to original coding
    style of having the return values and the function name on same line, eg
    prio2list.

    Signed-off-by: Krishna Kumar
    Signed-off-by: David S. Miller

    Krishna Kumar
     
  • Over the years this code has gotten hairier. Resulting in many long
    discussions over long summer days and patches that get it wrong.
    This patch helps tame that code so normal people will understand it.

    Thanks to Thomas Graf, Peter J. waskiewicz Jr, and Patrick McHardy
    for their valuable reviews.

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

04 Jun, 2007

1 commit


25 May, 2007

1 commit


11 May, 2007

4 commits

  • When we relinquish queue_lock in qdisc_restart and then retake it for
    requeueing, we might race against dev_deactivate and end up requeueing
    onto noop_qdisc. This causes a warning to be printed.

    This patch fixes this by checking this before we requeue. As an added
    bonus, we can remove the same check in __qdisc_run which was added to
    prevent dev->gso_skb from being requeued when we're shutting down.

    Even though we've had to add a new conditional in its place, it's better
    because it only happens on requeues rather than every single time that
    qdisc_run is called.

    For this to work we also need to move the clearing of gso_skb up in
    dev_deactivate as now qdisc_restart can occur even after we wait for
    __LINK_STATE_QDISC_RUNNING to clear (but it won't do anything as long
    as the queue and gso_skb is already clear).

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Now that we return the queue length after NETDEV_TX_OK we better
    make sure that we have the right queue. Otherwise we can cause a
    stall after a really quick dev_deactive/dev_activate.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The current return value scheme and associated comment was invented
    back in the 20th century when we still had that tbusy flag. Things
    have changed quite a bit since then (even Tony Blair is moving on
    now, not to mention the new French president).

    All we need to indicate now is whether the caller should continue
    processing the queue. Therefore it's sufficient if we return 0 if
    we want to stop and non-zero otherwise.

    This is based on a patch by Krishna Kumar.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • When transmit fails with NETDEV_TX_LOCKED the skb is requeued
    to dev->qdisc again. The dev->qdisc pointer is protected by
    the queue lock which needs to be dropped when attempting to
    transmit and acquired again before requeing. The problem is
    that qdisc_restart() fetches the dev->qdisc pointer once and
    stores it in the `q' variable which is invalidated when
    dropping the queue_lock, therefore the variable needs to be
    refreshed before requeueing.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

26 Apr, 2007

2 commits

  • Switch ingress queueing back to use ingress_lock. qdisc_lock_tree now locks
    both the ingress and egress qdiscs on the device. All changes to data that
    might be used on both ingress and egress needs to be protected by using
    qdisc_lock_tree instead of manually taking dev->queue_lock. Additionally
    the qdisc stats_lock needs to be initialized to ingress_lock for ingress
    qdiscs.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Since we're now holding the rtnl during the entire dump operation, we
    can remove qdisc_tree_lock, whose only purpose is to protect dump
    callbacks from concurrent changes to the qdisc tree.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

11 Feb, 2007

1 commit


09 Feb, 2007

1 commit

  • This patch introduces users of the round_jiffies() function in the
    networking code.

    These timers all were of the "about once a second" or "about once
    every X seconds" variety and several showed up in the "what wakes the
    cpu up" profiles that the tickless patches provide. Some timers are
    highly dynamic based on network load; but even on low activity systems
    they still show up so the rounding is done only in cases of low
    activity, allowing higher frequency timers in the high activity case.

    The various hardware watchdogs are an obvious case; they run every 2
    seconds but aren't otherwise specific of exactly when they need to
    run.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

03 Dec, 2006

2 commits


29 Sep, 2006

1 commit

  • The move of qdisc destruction to a rcu callback broke locking in the
    entire qdisc layer by invalidating previously valid assumptions about
    the context in which changes to the qdisc tree occur.

    The two assumptions were:

    - since changes only happen in process context, read_lock doesn't need
    bottem half protection. Now invalid since destruction of inner qdiscs,
    classifiers, actions and estimators happens in the RCU callback unless
    they're manually deleted, resulting in dead-locks when read_lock in
    process context is interrupted by write_lock_bh in bottem half context.

    - since changes only happen under the RTNL, no additional locking is
    necessary for data not used during packet processing (f.e. u32_list).
    Again, since destruction now happens in the RCU callback, this assumption
    is not valid anymore, causing races while using this data, which can
    result in corruption or use-after-free.

    Instead of "fixing" this by disabling bottem halfs everywhere and adding
    new locks/refcounting, this patch makes these assumptions valid again by
    moving destruction back to process context. Since only the dev->qdisc
    pointer is protected by RCU, but ->enqueue and the qdisc tree are still
    protected by dev->qdisc_lock, destruction of the tree can be performed
    immediately and only the final free needs to happen in the rcu callback
    to make sure dev_queue_xmit doesn't access already freed memory.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

18 Sep, 2006

1 commit

  • Fix lockdep warning with GRE, iptables and Speedtouch ADSL, PPP over ATM.

    On Sat, Sep 02, 2006 at 08:39:28PM +0000, Krzysztof Halasa wrote:
    >
    > =======================================================
    > [ INFO: possible circular locking dependency detected ]
    > -------------------------------------------------------
    > swapper/0 is trying to acquire lock:
    > (&dev->queue_lock){-+..}, at: [] dev_queue_xmit+0x56/0x290
    >
    > but task is already holding lock:
    > (&dev->_xmit_lock){-+..}, at: [] dev_queue_xmit+0x224/0x290
    >
    > which lock already depends on the new lock.

    This turns out to be a genuine bug. The queue lock and xmit lock are
    intentionally taken out of order. Two things are supposed to prevent
    dead-locks from occuring:

    1) When we hold the queue_lock we're supposed to only do try_lock on the
    tx_lock.

    2) We always drop the queue_lock after taking the tx_lock and before doing
    anything else.

    >
    > the existing dependency chain (in reverse order) is:
    >
    > -> #1 (&dev->_xmit_lock){-+..}:
    > [] lock_acquire+0x76/0xa0
    > [] _spin_lock_bh+0x31/0x40
    > [] dev_activate+0x69/0x120

    This path obviously breaks assumption 1) and therefore can lead to ABBA
    dead-locks.

    I've looked at the history and there seems to be no reason for the lock
    to be held at all in dev_watchdog_up. The lock appeared in day one and
    even there it was unnecessary. In fact, people added __dev_watchdog_up
    precisely in order to get around the tx lock there.

    The function dev_watchdog_up is already serialised by rtnl_lock since
    its only caller dev_activate is always called under it.

    So here is a simple patch to remove the tx lock from dev_watchdog_up.
    In 2.6.19 we can eliminate the unnecessary __dev_watchdog_up and
    replace it with dev_watchdog_up.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

22 Jul, 2006

1 commit


01 Jul, 2006

1 commit


23 Jun, 2006

2 commits

  • This patch adds the infrastructure for generic segmentation offload.
    The idea is to tap into the potential savings of TSO without hardware
    support by postponing the allocation of segmented skb's until just
    before the entry point into the NIC driver.

    The same structure can be used to support software IPv6 TSO, as well as
    UFO and segmentation offload for other relevant protocols, e.g., DCCP.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The dev_deactivate function has bit-rotted since the introduction of
    lockless drivers. In particular, the spin_unlock_wait call at the end
    has no effect on the xmit routine of lockless drivers.

    With a little bit of work, we can make it much more useful by providing
    the guarantee that when it returns, no more calls to the xmit routine
    of the underlying driver will be made.

    The idea is simple. There are two entry points in to the xmit routine.
    The first comes from dev_queue_xmit. That one is easily stopped by
    using synchronize_rcu. This works because we set the qdisc to noop_qdisc
    before the synchronize_rcu call. That in turn causes all subsequent
    packets sent to dev_queue_xmit to be dropped. The synchronize_rcu call
    also ensures all outstanding calls leave their critical section.

    The other entry point is from qdisc_run. Since we now have a bit that
    indicates whether it's running, all we have to do is to wait until the
    bit is off.

    I've removed the loop to wait for __LINK_STATE_SCHED to clear. This is
    useless because netif_wake_queue can cause it to be set again. It is
    also harmless because we've disarmed qdisc_run.

    I've also removed the spin_unlock_wait on xmit_lock because its only
    purpose of making sure that all outstanding xmit_lock holders have
    exited is also given by dev_watchdog_down.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

20 Jun, 2006

1 commit

  • Having two or more qdisc_run's contend against each other is bad because
    it can induce packet reordering if the packets have to be requeued. It
    appears that this is an unintended consequence of relinquinshing the queue
    lock while transmitting. That in turn is needed for devices that spend a
    lot of time in their transmit routine.

    There are no advantages to be had as devices with queues are inherently
    single-threaded (the loopback device is not but then it doesn't have a
    queue).

    Even if you were to add a queue to a parallel virtual device (e.g., bolt
    a tbf filter in front of an ipip tunnel device), you would still want to
    process the queue in sequence to ensure that the packets are ordered
    correctly.

    The solution here is to steal a bit from net_device to prevent this.

    BTW, as qdisc_restart is no longer used by anyone as a module inside the
    kernel (IIRC it used to with netif_wake_queue), I have not exported the
    new __qdisc_run function.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

18 Jun, 2006

1 commit

  • Various drivers use xmit_lock internally to synchronise with their
    transmission routines. They do so without setting xmit_lock_owner.
    This is fine as long as netpoll is not in use.

    With netpoll it is possible for deadlocks to occur if xmit_lock_owner
    isn't set. This is because if a printk occurs while xmit_lock is held
    and xmit_lock_owner is not set can cause netpoll to attempt to take
    xmit_lock recursively.

    While it is possible to resolve this by getting netpoll to use
    trylock, it is suboptimal because netpoll's sole objective is to
    maximise the chance of getting the printk out on the wire. So
    delaying or dropping the message is to be avoided as much as possible.

    So the only alternative is to always set xmit_lock_owner. The
    following patch does this by introducing the netif_tx_lock family of
    functions that take care of setting/unsetting xmit_lock_owner.

    I renamed xmit_lock to _xmit_lock to indicate that it should not be
    used directly. I didn't provide irq versions of the netif_tx_lock
    functions since xmit_lock is meant to be a BH-disabling lock.

    This is pretty much a straight text substitution except for a small
    bug fix in winbond. It currently uses
    netif_stop_queue/spin_unlock_wait to stop transmission. This is
    unsafe as an IRQ can potentially wake up the queue. So it is safer to
    use netif_tx_disable.

    The hamradio bits used spin_lock_irq but it is unnecessary as
    xmit_lock must never be taken in an IRQ handler.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

17 May, 2006

1 commit


21 Mar, 2006

1 commit


30 Aug, 2005

1 commit

  • # grep -r 'netif_carrier_o[nf]' linux-2.6.12 | wc -l
    246

    # size vmlinux.org vmlinux.carrier
    text data bss dec hex filename
    4339634 1054414 259296 5653344 564360 vmlinux.org
    4337710 1054414 259296 5651420 563bdc vmlinux.carrier

    And this ain't an allyesconfig kernel!

    Signed-off-by: David S. Miller

    Denis Vlasenko
     

24 Aug, 2005

1 commit

  • qdisc_create_dflt() is missing to destroy the newly allocated
    default qdisc if the initialization fails resulting in leaks
    of all kinds. The only caller in mainline which may trigger
    this bug is sch_tbf.c in tbf_create_dflt_qdisc().

    Note: qdisc_create_dflt() doesn't fulfill the official locking
    requirements of qdisc_destroy() but since the qdisc could
    never be seen by the outside world this doesn't matter
    and it can stay as-is until the locking of pkt_sched
    is cleaned up.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

19 Jul, 2005

1 commit

  • The current call to __qdisc_dequeue_head leads to a branch
    misprediction for every loop iteration, the fact that the
    most common priority is 2 makes this even worse. This issue
    has been brought up by Eric Dumazet
    but unlike his solution which was to manually unroll the loop,
    this approach preserves the possibility to increase the number
    of bands at compile time.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

06 Jul, 2005

1 commit


19 Jun, 2005

4 commits


04 May, 2005

2 commits

  • Due to bugs in netem (fixed by later patches), it is possible to get qdisc
    qlen to go negative. If this happens the CPU ends up spinning forever
    in qdisc_run(). So add a BUG_ON() to trap it.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • Some network drivers call netif_stop_queue() when detecting loss of
    carrier. This leads to packets being queued up at the qdisc level for
    an unbound period of time. In order to prevent this effect, the core
    networking stack will now cease to queue packets for any device, that
    is operationally down (i.e. the queue is flushed and disabled).

    Signed-off-by: Tommy S. Christensen
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Tommy S. Christensen
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds