16 Oct, 2007

1 commit


11 Oct, 2007

11 commits

  • The fourth parameter of /proc/net/psched is supposed to show the timer
    resultion and is used by HTB userspace to calculate the necessary
    burst rate. Currently we show the clock resolution, which results in a
    too low burst rate when the two differ.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Fix a bunch of sparse warnings. Mostly about 0 used as
    NULL pointer, and shadowed variable declarations.
    One notable case was that hash size should have been unsigned.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • Stateless NAT is useful in controlled environments where restrictions are
    placed on through traffic such that we don't need connection tracking to
    correctly NAT protocol-specific data.

    In particular, this is of interest when the number of flows or the number
    of addresses being NATed is large, or if connection tracking information
    has to be replicated and where it is not practical to do so.

    Previously we had stateless NAT functionality which was integrated into
    the IPv4 routing subsystem. This was a great solution as long as the NAT
    worked on a subnet to subnet basis such that the number of NAT rules was
    relatively small. The reason is that for SNAT the routing based system
    had to perform a linear scan through the rules.

    If the number of rules is large then major renovations would have take
    place in the routing subsystem to make this practical.

    For the time being, the least intrusive way of achieving this is to use
    the u32 classifier written by Alexey Kuznetsov along with the actions
    infrastructure implemented by Jamal Hadi Salim.

    The following patch is an attempt at this problem by creating a new nat
    action that can be invoked from u32 hash tables which would allow large
    number of stateless NAT rules that can be used/updated in constant time.

    The actual NAT code is mostly based on the previous stateless NAT code
    written by Alexey. In future we might be able to utilise the protocol
    NAT code from netfilter to improve support for other protocols.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Since hardware header operations are part of the protocol class
    not the device instance, make them into a separate object and
    save memory.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • Add inline for common usage of hardware header creation, and
    fix bug in IPV6 mcast where the assumption about negative return is
    an errno. Negative return from hard_header means not enough space
    was available,(ie -N bytes).

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • For N cpus, with full throttle traffic on all N CPUs, funneling traffic
    to the same ethernet device, the devices queue lock is contended by all
    N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
    In the current mode of operation, after all the work of entering the
    dequeue region, we may endup aborting the path if we are unable to get
    the tx lock and go back to contend for the queue lock. As N goes up,
    this gets worse.

    The changes in this patch result in a small increase in performance
    with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3
    showed similar behavior;

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • It's been a useless no-op for long enough in 2.6 so I figured it's time to
    remove it. The number of people that could object because they're
    maintaining unified 2.4 and 2.6 drivers is probably rather small.

    [ Handled drivers added by netdev tree and some missed IRDA cases... -DaveM ]

    Signed-off-by: Ralf Baechle
    Signed-off-by: Jeff Garzik
    Signed-off-by: David S. Miller

    Ralf Baechle
     
  • Change L2T (length to time) macros, in all rate based schedulers, to
    call a common function qdisc_l2t() that does the rate table lookup.
    This function handles if the packet size lookup is larger than the
    rate table, which often occurs with TSO enabled.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • This patch makes most of the generic device layer network
    namespace safe. This patch makes dev_base_head a
    network namespace variable, and then it picks up
    a few associated variables. The functions:
    dev_getbyhwaddr
    dev_getfirsthwbytype
    dev_get_by_flags
    dev_get_by_name
    __dev_get_by_name
    dev_get_by_index
    __dev_get_by_index
    dev_ioctl
    dev_ethtool
    dev_load
    wireless_process_ioctl

    were modified to take a network namespace argument, and
    deal with it.

    vlan_ioctl_set and brioctl_set were modified so their
    hooks will receive a network namespace argument.

    So basically anthing in the core of the network stack that was
    affected to by the change of dev_base was modified to handle
    multiple network namespaces. The rest of the network stack was
    simply modified to explicitly use &init_net the initial network
    namespace. This can be fixed when those components of the network
    stack are modified to handle multiple network namespaces.

    For now the ifindex generator is left global.

    Fundametally ifindex numbers are per namespace, or else
    we will have corner case problems with migration when
    we get that far.

    At the same time there are assumptions in the network stack
    that the ifindex of a network device won't change. Making
    the ifindex number global seems a good compromise until
    the network stack can cope with ifindex changes when
    you change namespaces, and the like.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Several devices have multiple independant RX queues per net
    device, and some have a single interrupt doorbell for several
    queues.

    In either case, it's easier to support layouts like that if the
    structure representing the poll is independant from the net
    device itself.

    The signature of the ->poll() call back goes from:

    int foo_poll(struct net_device *dev, int *budget)

    to

    int foo_poll(struct napi_struct *napi, int budget)

    The caller is returned the number of RX packets processed (or
    the number of "NAPI credits" consumed if you want to get
    abstract). The callee no longer messes around bumping
    dev->quota, *budget, etc. because that is all handled in the
    caller upon return.

    The napi_struct is to be embedded in the device driver private data
    structures.

    Furthermore, it is the driver's responsibility to disable all NAPI
    instances in it's ->stop() device close handler. Since the
    napi_struct is privatized into the driver's private data structures,
    only the driver knows how to get at all of the napi_struct instances
    it may have per-device.

    With lots of help and suggestions from Rusty Russell, Roland Dreier,
    Michael Chan, Jeff Garzik, and Jamal Hadi Salim.

    Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
    Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.

    [ Ported to current tree and all drivers converted. Integrated
    Stephen's follow-on kerneldoc additions, and restored poll_list
    handling to the old style to fix mutual exclusion issues. -DaveM ]

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

08 Oct, 2007

1 commit


02 Oct, 2007

1 commit

  • This is followup to Patrick's patch. A little optimization to enqueue
    routine allows to remove artificial limitation on queue length.

    Plus, testing showed that hash function used by SFQ is too bad or even worse.
    It does not even sweep the whole range of hash values.
    Switched to Jenkins' hash.

    Signed-off-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Alexey Kuznetsov
     

21 Sep, 2007

1 commit


17 Sep, 2007

1 commit


15 Sep, 2007

1 commit

  • (with no apologies to C Heston)

    On Mon, 2007-10-09 at 21:00 +0800, Herbert Xu wrote:
    On Sun, Sep 02, 2007 at 01:11:29PM +0000, Christian Kujau wrote:
    > >
    > > after upgrading to 2.6.23-rc5 (and applying davem's fix [0]), lockdep
    > > was quite noisy when I tried to shape my external (wireless) interface:
    > >
    > > [ 6400.534545] FahCore_78.exe/3552 just changed the state of lock:
    > > [ 6400.534713] (&dev->ingress_lock){-+..}, at: []
    > > netif_receive_skb+0x2d5/0x3c0
    > > [ 6400.534941] but this lock took another, soft-read-irq-unsafe lock in the
    > > past:
    > > [ 6400.535145] (police_lock){-.--}
    >
    > This is a genuine dead-lock. The police lock can be taken
    > for reading with softirqs on. If a second CPU tries to take
    > the police lock for writing, while holding the ingress lock,
    > then a softirq on the first CPU can dead-lock when it tries
    > to get the ingress lock.

    Signed-off-by: Jamal Hadi Salim
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

31 Aug, 2007

1 commit

  • When CONFIG_NET_CLS_ACT is enabled, tc_classify() is called twice in
    prio_classify(). This causes "interesting" behaviour: with the setup
    below, packets are duplicated, sent twice to ifb0, and then loop in and
    out of ifb0.

    The patch uses the previously calculated return value in the switch,
    which is probably what Patrick had in mind in commit
    bdba91ec70fb5ccbdeb1c7068319adc6ea9e1a7d -- maybe Patrick can
    double-check this?

    -- example setup --
    ifconfig ifb0 up
    tc qdisc add dev ifb0 root netem delay 2s
    tc qdisc add dev $ETH root handle 1: prio
    tc filter add dev $ETH parent 1: protocol ip prio 10 u32 \
    match ip dst 172.24.110.6/32 flowid 1:1 \
    action mirred egress redirect dev ifb0
    ping -c1 172.24.110.6

    Signed-off-by: Lucas Nussbaum
    Signed-off-by: David S. Miller

    Lucas Nussbaum
     

14 Aug, 2007

1 commit


31 Jul, 2007

3 commits


18 Jul, 2007

2 commits


15 Jul, 2007

6 commits

  • The NET_CLS_ACT option is now a full replacement for NET_CLS_POLICE,
    remove the old code. The config option will be kept around to select
    the equivalent NET_CLS_ACT options for a short time to allow easier
    upgrades.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • The behaviour of NET_CLS_POLICE for TC_POLICE_RECLASSIFY was to return
    it to the qdisc, which could handle it internally or ignore it. With
    NET_CLS_ACT however, tc_classify starts over at the first classifier
    and never returns it to the qdisc. This makes it impossible to support
    qdisc-internal reclassification, which in turn makes it impossible to
    remove the old NET_CLS_POLICE code without breaking compatibility since
    we have two qdiscs (CBQ and ATM) that support this.

    This patch adds a tc_classify_compat function that handles
    reclassification the old way and changes CBQ and ATM to use it.

    This again is of course not fully backwards compatible with the previous
    NET_CLS_ACT behaviour. Unfortunately there is no way to fully maintain
    compatibility *and* support qdisc internal reclassification with
    NET_CLS_ACT, but this seems like the better choice over keeping the two
    incompatible options around forever.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Handle act_api classification results.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Handle act_api classification results.

    The ATM scheduler behaves slightly different than other schedulers
    in that it only handles policer results for successful classifications,
    this behaviour is retained for the act_api case.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • As noticed by Ranko Zivojnovic , calling qdisc_run
    from the timer handler can result in deadlock:

    > CPU#0
    >
    > qdisc_watchdog() fires and gets dev->queue_lock
    > qdisc_run()...qdisc_restart()...
    > -> releases dev->queue_lock and enters dev_hard_start_xmit()
    >
    > CPU#1
    >
    > tc del qdisc dev ...
    > qdisc_graft()...dev_graft_qdisc()...dev_deactivate()...
    > -> grabs dev->queue_lock ...
    >
    > qdisc_reset()...{cbq,hfsc,htb,netem,tbf}_reset()...qdisc_watchdog_cancel()...
    > -> hrtimer_cancel() - waiting for the qdisc_watchdog() to exit, while still
    > holding dev->queue_lock
    >
    > CPU#0
    >
    > dev_hard_start_xmit() returns ...
    > -> wants to get dev->queue_lock(!)
    >
    > DEADLOCK!

    The entire optimization is a bit questionable IMO, it moves potentially
    large parts of NET_TX_SOFTIRQ work to TIMER_SOFTIRQ/HRTIMER_SOFTIRQ,
    which kind of defeats the separation of them.

    Signed-off-by: Patrick McHardy
    Acked-by: Ranko Zivojnovic
    Signed-off-by: David S. Miller

    Patrick McHardy
     

12 Jul, 2007

1 commit


11 Jul, 2007

9 commits

  • Currently the HTB scheduler does not correctly account for TSO packets
    which causes large inaccuracies in the bandwidth control when using TSO.
    This patch allows the HTB scheduler to work with TSO enabled devices.

    Signed-off-by: Ranjit Manomohan
    Signed-off-by: David S. Miller

    Ranjit Manomohan
     
  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Use the generic estimator instead of reimplementing (parts of) it.
    For compatibility always create a default estimator for new classes.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Remove stats_lock pointers from qdisc-internal structures, in all cases
    it points to dev->queue_lock. The only case where it is necessary is for
    top-level qdiscs, where it might also point to dev->ingress_lock in case
    of the ingress qdisc. Also remove it from actions completely, it always
    points to the actions internal lock.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • The generic estimator is always built in anways and all the config options
    does is prevent including a minimal amount of code for setting it up.
    Additionally the option is already automatically selected for most cases.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Add the new sch_rr qdisc for multiqueue network device support. Allow
    sch_prio and sch_rr to be compiled with or without multiqueue hardware
    support.

    sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS. This
    was done since sch_prio and sch_rr only differ in their dequeue
    routine.

    Signed-off-by: Peter P Waskiewicz Jr
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Peter P Waskiewicz Jr
     
  • Add the multiqueue hardware device support API to the core network
    stack. Allow drivers to allocate multiple queues and manage them at
    the netdev level if they choose to do so.

    Added a new field to sk_buff, namely queue_mapping, for drivers to
    know which tx_ring to select based on OS classification of the flow.

    Signed-off-by: Peter P Waskiewicz Jr
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Peter P Waskiewicz Jr
     
  • Changes :

    - netif_queue_stopped need not be called inside qdisc_restart as
    it has been called already in qdisc_run() before the first skb
    is sent, and in __qdisc_run() after each intermediate skb is
    sent (note : we are the only sender, so the queue cannot get
    stopped while the tx lock was got in the ~LLTX case).

    - BUG_ON((int) q->q.qlen < 0) was a relic from old times when -1
    meant more packets are available, and __qdisc_run used to loop
    when qdisc_restart() returned -1. During those days, it was
    necessary to make sure that qlen is never less than zero, since
    __qdisc_run would get into an infinite loop if no packets are on
    the queue and this bug in qdisc was there (and worse - no more
    skbs could ever get queue'd as we hold the queue lock too). With
    Herbert's recent change to return values, this check is not
    required. Hopefully Herbert can validate this change. If at all
    this is required, it should be added to skb_dequeue (in failure
    case), and not to qdisc_qlen.

    Signed-off-by: Krishna Kumar
    Signed-off-by: David S. Miller

    Krishna Kumar
     
  • New changes :

    - Incorporated Peter Waskiewicz's comments.
    - Re-added back one warning message (on driver returning wrong value).

    Previous changes :

    - Converted to use switch/case code which looks neater.

    - "if (ret == NETDEV_TX_LOCKED && lockless)" is buggy, and the lockless
    check should be removed, since driver will return NETDEV_TX_LOCKED only
    if lockless is true and driver has to do the locking. In the original
    code as well as the latest code, this code can result in a bug where
    if LLTX is not set for a driver (lockless == 0) but the driver is written
    wrongly to do a trylock (despite LLTX being set), the driver returns
    LOCKED. But since lockless is zero, the packet is requeue'd instead of
    calling collision code which will issue warning and free up the skb.
    Instead this skb will be retried with this driver next time, and the same
    result will ensue. Removing this check will catch these driver bugs instead
    of hiding the problem. I am keeping this change to readability section
    since :
    a. it is confusing to check two things as it is; and
    b. it is difficult to keep this check in the changed 'switch' code.

    - Changed some names, like try_get_tx_pkt to dev_dequeue_skb (as that is
    the work being done and easier to understand) and do_dev_requeue to
    dev_requeue_skb, merged handle_dev_cpu_collision and tx_islocked to
    dev_handle_collision (handle_dev_cpu_collision is a small routine with only
    one caller, so there is no need to have two separate routines which also
    results in getting rid of two macros, etc.

    - Removed an XXX comment as it should never fail (I suspect this was related
    to batch skb WIP, Jamal ?). Converted some functions to original coding
    style of having the return values and the function name on same line, eg
    prio2list.

    Signed-off-by: Krishna Kumar
    Signed-off-by: David S. Miller

    Krishna Kumar