06 Jan, 2016

1 commit

  • When a qdisc is using per cpu stats (currently just the ingress
    qdisc) only the bstats are being freed. This also free's the qstats.

    Fixes: b0ab6f92752b9f9d8 ("net: sched: enable per cpu qstats")
    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    John Fastabend
     

04 Dec, 2015

1 commit

  • qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
    devices.

    One problem is that it updates sch->q.qlen and sch->qstats.drops
    on the mq/mqprio root qdisc, while it should not : Daniele
    reported underflows errors :
    [ 681.774821] PAX: sch->q.qlen: 0 n: 1
    [ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
    [ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
    [ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
    [ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
    [ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
    [ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
    [ 681.774962] Call Trace:
    [ 681.774967] [] dump_stack+0x4c/0x7f
    [ 681.774970] [] report_size_overflow+0x34/0x50
    [ 681.774972] [] qdisc_tree_decrease_qlen+0x152/0x160
    [ 681.774976] [] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
    [ 681.774978] [] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
    [ 681.774980] [] __qdisc_run+0x4d/0x1d0
    [ 681.774983] [] net_tx_action+0xc2/0x160
    [ 681.774985] [] __do_softirq+0xf1/0x200
    [ 681.774987] [] run_ksoftirqd+0x1e/0x30
    [ 681.774989] [] smpboot_thread_fn+0x150/0x260
    [ 681.774991] [] ? sort_range+0x40/0x40
    [ 681.774992] [] kthread+0xe4/0x100
    [ 681.774994] [] ? kthread_worker_fn+0x170/0x170
    [ 681.774995] [] ret_from_fork+0x3e/0x70

    mq/mqprio have their own ways to report qlen/drops by folding stats on
    all their queues, with appropriate locking.

    A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
    without proper locking : concurrent qdisc updates could corrupt the list
    that qdisc_match_from_root() parses to find a qdisc given its handle.

    Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
    qdisc_tree_decrease_qlen() can use to abort its tree traversal,
    as soon as it meets a mq/mqprio qdisc children.

    Second problem can be fixed by RCU protection.
    Qdisc are already freed after RCU grace period, so qdisc_list_add() and
    qdisc_list_del() simply have to use appropriate rcu list variants.

    A future patch will add a per struct netdev_queue list anchor, so that
    qdisc_tree_decrease_qlen() can have more efficient lookups.

    Reported-by: Daniele Fucini
    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Aug, 2015

3 commits


18 Aug, 2015

1 commit


10 Oct, 2014

1 commit

  • Restore the quota fairness between qdisc's, that we broke with commit
    5772e9a346 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE").

    Before that commit, the quota in __qdisc_run() were in packets as
    dequeue_skb() would only dequeue a single packet, that assumption
    broke with bulk dequeue.

    We choose not to account for the number of packets inside the TSO/GSO
    packets (accessable via "skb_gso_segs"). As the previous fairness
    also had this "defect". Thus, GSO/TSO packets counts as a single
    packet.

    Further more, we choose to slack on accuracy, by allowing a bulk
    dequeue try_bulk_dequeue_skb() to exceed the "packets" limit, only
    limited by the BQL bytelimit. This is done because BQL prefers to get
    its full budget for appropriate feedback from TX completion.

    In future, we might consider reworking this further and, if it allows,
    switch to a time-based model, as suggested by Eric. Right now, we only
    restore old semantics.

    Joint work with Eric, Hannes, Daniel and Jesper. Hannes wrote the
    first patch in cooperation with Daniel and Jesper. Eric rewrote the
    patch.

    Fixes: 5772e9a346 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

08 Oct, 2014

1 commit

  • Testing xmit_more support with netperf and connected UDP sockets,
    I found strange dst refcount false sharing.

    Current handling of IFF_XMIT_DST_RELEASE is not optimal.

    Dropping dst in validate_xmit_skb() is certainly too late in case
    packet was queued by cpu X but dequeued by cpu Y

    The logical point to take care of drop/force is in __dev_queue_xmit()
    before even taking qdisc lock.

    As Julian Anastasov pointed out, need for skb_dst() might come from some
    packet schedulers or classifiers.

    This patch adds new helper to cleanly express needs of various drivers
    or qdiscs/classifiers.

    Drivers that need skb_dst() in their ndo_start_xmit() should call
    following helper in their setup instead of the prior :

    dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
    ->
    netif_keep_dst(dev);

    Instead of using a single bit, we use two bits, one being
    eventually rebuilt in bonding/team drivers.

    The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
    rebuilt in bonding/team. Eventually, we could add something
    smarter later.

    Signed-off-by: Eric Dumazet
    Cc: Julian Anastasov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Oct, 2014

3 commits

  • Validation of skb can be pretty expensive :

    GSO segmentation and/or checksum computations.

    We can do this without holding qdisc lock, so that other cpus
    can queue additional packets.

    Trick is that requeued packets were already validated, so we carry
    a boolean so that sch_direct_xmit() can validate a fresh skb list,
    or directly use an old one.

    Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
    host.

    Turning TSO on or off had no effect on throughput, only few more cpu
    cycles. Lock contention on qdisc lock disappeared.

    Same if disabling TX checksum offload.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The TSO and GSO segmented packets already benefit from bulking
    on their own.

    The TSO packets have always taken advantage of the only updating
    the tailptr once for a large packet.

    The GSO segmented packets have recently taken advantage of
    bulking xmit_more API, via merge commit 53fda7f7f9e8 ("Merge
    branch 'xmit_list'"), specifically via commit 7f2e870f2a4 ("net:
    Move main gso loop out of dev_hard_start_xmit() into helper.")
    allowing qdisc requeue of remaining list. And via commit
    ce93718fb7cd ("net: Don't keep around original SKB when we
    software segment GSO frames.").

    This patch allow further bulking of TSO/GSO packets together,
    when dequeueing from the qdisc.

    Testing:
    Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
    netperf-wrapper. Bulking several TSO show no performance regressions
    (requeues were in the area 32 requeues/sec).

    Bulking several GSOs does show small regression or very small
    improvement (requeues were in the area 8000 requeues/sec).

    Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
    latency. Base-case, which is "normal" GSO bulking, sees varying
    high-prio queue delay between 0.38ms to 0.47ms. Bulking several GSOs
    together, result in a stable high-prio queue delay of 0.50ms.

    Using igb at 100Mbit/s with GSO bulking, shows an improvement.
    Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms

    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
    sending/processing an entire skb list.

    This patch implements qdisc bulk dequeue, by allowing multiple packets
    to be dequeued in dequeue_skb().

    The optimization principle for this is two fold, (1) to amortize
    locking cost and (2) avoid expensive tailptr update for notifying HW.
    (1) Several packets are dequeued while holding the qdisc root_lock,
    amortizing locking cost over several packet. The dequeued SKB list is
    processed under the TXQ lock in dev_hard_start_xmit(), thus also
    amortizing the cost of the TXQ lock.
    (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
    API to delay HW tailptr update, which also reduces the cost per
    packet.

    One restriction of the new API is that every SKB must belong to the
    same TXQ. This patch takes the easy way out, by restricting bulk
    dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
    qdisc only have attached a single TXQ.

    Some detail about the flow; dev_hard_start_xmit() will process the skb
    list, and transmit packets individually towards the driver (see
    xmit_one()). In case the driver stops midway in the list, the
    remaining skb list is returned by dev_hard_start_xmit(). In
    sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

    To avoid overshooting the HW limits, which results in requeuing, the
    patch limits the amount of bytes dequeued, based on the drivers BQL
    limits. In-effect bulking will only happen for BQL enabled drivers.

    Small amounts for extra HoL blocking (2x MTU/0.24ms) were
    measured at 100Mbit/s, with bulking 8 packets, but the
    oscillating nature of the measurement indicate something, like
    sched latency might be causing this effect. More comparisons
    show, that this oscillation goes away occationally. Thus, we
    disregard this artifact completely and remove any "magic" bulking
    limit.

    For now, as a conservative approach, stop bulking when seeing TSO and
    segmented GSO packets. They already benefit from bulking on their own.
    A followup patch add this, to allow easier bisect-ability for finding
    regressions.

    Jointed work with Hannes, Daniel and Florian.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

30 Sep, 2014

1 commit

  • In order to run qdisc's without locking statistics and estimators
    need to be handled correctly.

    To resolve bstats make the statistics per cpu. And because this is
    only needed for qdiscs that are running without locks which is not
    the case for most qdiscs in the near future only create percpu
    stats when qdiscs set the TCQ_F_CPUSTATS flag.

    Next because estimators use the bstats to calculate packets per
    second and bytes per second the estimator code paths are updated
    to use the per cpu statistics.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

20 Sep, 2014

1 commit


14 Sep, 2014

1 commit

  • Add __rcu notation to qdisc handling by doing this we can make
    smatch output more legible. And anyways some of the cases should
    be using rcu_dereference() see qdisc_all_tx_empty(),
    qdisc_tx_chainging(), and so on.

    Also *wake_queue() API is commonly called from driver timer routines
    without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
    around netif_wake_subqueue and netif_tx_wake_queue.

    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend
     

04 Sep, 2014

1 commit

  • More minor fixes to merge commit 53fda7f7f9e (Merge branch 'xmit_list')
    that allows us to work with a list of SKBs.

    Fixing exit cases in qdisc_reset() and qdisc_destroy(), where a
    leftover requeued SKB (qdisc->gso_skb) can have the potential of
    being a skb list, thus use kfree_skb_list().

    This is a followup to commit 10770bc2d1 ("qdisc: adjustments for
    API allowing skb list xmits").

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

03 Sep, 2014

1 commit


02 Sep, 2014

2 commits


30 Aug, 2014

1 commit


02 Jul, 2014

1 commit

  • In commit 371121057607e3127e19b3fa094330181b5b031e("net:
    QDISC_STATE_RUNNING dont need atomic bit ops") the
    __QDISC_STATE_RUNNING is renamed to __QDISC___STATE_RUNNING,
    but the old names existing in comment are not replaced with
    the new name completely.

    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

01 Apr, 2014

1 commit

  • This allows to monitor carrier on/off transitions and detect link
    flapping issues:
    - new /sys/class/net/X/carrier_changes
    - new rtnetlink IFLA_CARRIER_CHANGES (getlink)

    Tested:
    - grep . /sys/class/net/*/carrier_changes
    + ip link set dev X down/up
    + plug/unplug cable
    - updated iproute2: prints IFLA_CARRIER_CHANGES
    - iproute2 20121211-2 (debian): unchanged behavior

    Signed-off-by: David Decotigny
    Signed-off-by: David S. Miller

    david decotigny
     

15 Jan, 2014

1 commit


11 Jan, 2014

1 commit

  • Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
    will cause several issues:

    - NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
    instead of lower device which misses the necessary txq synchronization for
    lower device such as txq stopping or frozen required by dev watchdog or
    control path.
    - dev_hard_start_xmit() was called with NULL txq which bypasses the net device
    watchdog.
    - dev_hard_start_xmit() does not check txq everywhere which will lead a crash
    when tso is disabled for lower device.

    Fix this by explicitly introducing a new param for .ndo_select_queue() for just
    selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
    extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
    forwarding transmission.

    With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
    to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
    a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
    dev_queue_xmit() to do the transmission.

    In the future, it was also required for macvtap l2 forwarding support since it
    provides a necessary synchronization method.

    Cc: John Fastabend
    Cc: Neil Horman
    Cc: e1000-devel@lists.sourceforge.net
    Signed-off-by: Jason Wang
    Acked-by: Neil Horman
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jason Wang
     

14 Dec, 2013

1 commit

  • After commit 95dc19299f74 ("pkt_sched: give visibility to mq slave
    qdiscs") we call disc_list_add() while the device qdisc might be
    the noop_qdisc one.

    This shows up as duplicates in "tc qdisc show", as all inactive devices
    point to noop_qdisc.

    Fix this by setting dev->qdisc to the new qdisc before calling
    ops->change() in attach_default_qdiscs()

    Add a WARN_ON_ONCE() to catch any future similar problem.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Dec, 2013

1 commit


08 Nov, 2013

1 commit

  • Add a operations structure that allows a network interface to export
    the fact that it supports package forwarding in hardware between
    physical interfaces and other mac layer devices assigned to it (such
    as macvlans). This operaions structure can be used by virtual mac
    devices to bypass software switching so that forwarding can be done
    in hardware more efficiently.

    Signed-off-by: John Fastabend
    Signed-off-by: Neil Horman
    CC: Andy Gospodarek
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    John Fastabend
     

08 Oct, 2013

1 commit

  • Separate the unreg_list and the close_list in dev_close_many preventing
    dev_close_many from permuting the unreg_list. The permutations of the
    unreg_list have resulted in cases where the loopback device is accessed
    it has been freed in code such as dst_ifdown. Resulting in subtle memory
    corruption.

    This is the second bug from sharing the storage between the close_list
    and the unreg_list. The issues that crop up with sharing are
    apparently too subtle to show up in normal testing or usage, so let's
    forget about being clever and use two separate lists.

    v2: Make all callers pass in a close_list to dev_close_many

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

21 Sep, 2013

1 commit

  • Add an extra u64 rate parameter to psched_ratecfg_precompute()
    so that some qdisc can opt-in for 64bit rates in the future,
    to overcome the ~34 Gbits limit.

    psched_ratecfg_getrate() reports a legacy structure to
    tc utility, so if actual rate is above the 32bit rate field,
    cap it to the 34Gbit limit.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Sep, 2013

2 commits


31 Aug, 2013

1 commit

  • By default, the pfifo_fast queue discipline has been used by default
    for all devices. But we have better choices now.

    This patch allow setting the default queueing discipline with sysctl.
    This allows easy use of better queueing disciplines on all devices
    without having to use tc qdisc scripts. It is intended to allow
    an easy path for distributions to make fq_codel or sfq the default
    qdisc.

    This patch also makes pfifo_fast more of a first class qdisc, since
    it is now possible to manually override the default and explicitly
    use pfifo_fast. The behavior for systems who do not use the sysctl
    is unchanged, they still get pfifo_fast

    Also removes leftover random # in sysctl net core.

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

15 Aug, 2013

1 commit

  • commit 56b765b79 ("htb: improved accuracy at high rates")
    broke the "linklayer atm" handling.

    tc class add ... htb rate X ceil Y linklayer atm

    The linklayer setting is implemented by modifying the rate table
    which is send to the kernel. No direct parameter were
    transferred to the kernel indicating the linklayer setting.

    The commit 56b765b79 ("htb: improved accuracy at high rates")
    removed the use of the rate table system.

    To keep compatible with older iproute2 utils, this patch detects
    the linklayer by parsing the rate table. It also supports future
    versions of iproute2 to send this linklayer parameter to the
    kernel directly. This is done by using the __reserved field in
    struct tc_ratespec, to convey the choosen linklayer option, but
    only using the lower 4 bits of this field.

    Linklayer detection is limited to speeds below 100Mbit/s, because
    at high rates the rtab is gets too inaccurate, so bad that
    several fields contain the same values, this resembling the ATM
    detect. Fields even start to contain "0" time to send, e.g. at
    1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
    been more broken than we first realized.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

06 Aug, 2013

1 commit

  • Vlan devices are LLTX and don't update their own trans_start, so if
    dev_trans_start has to be called with a vlan device then 0 or a stale
    value will be returned. Currently the bonding is the only such user, and
    it's needed for proper arp monitoring when the slaves are vlans.
    Fix this by extracting the vlan's real device trans_start.

    Suggested-by: David Miller
    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Veaceslav Falico
    Signed-off-by: David S. Miller

    nikolay@redhat.com
     

12 Jun, 2013

1 commit


03 Jun, 2013

1 commit

  • commit 56b765b79 ("htb: improved accuracy at high rates")
    broke the "overhead xxx" handling, as well as the "linklayer atm"
    attribute.

    tc class add ... htb rate X ceil Y linklayer atm overhead 10

    This patch restores the "overhead xxx" handling, for htb, tbf
    and act_police

    The "linklayer atm" thing needs a separate fix.

    Reported-by: Jesper Dangaard Brouer
    Signed-off-by: Eric Dumazet
    Cc: Vimalkumar
    Cc: Jiri Pirko
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Mar, 2013

1 commit

  • It seems that commit

    commit 292f1c7ff6cc10516076ceeea45ed11833bb71c7
    Author: Jiri Pirko
    Date: Tue Feb 12 00:12:03 2013 +0000

    sch: make htb_rate_cfg and functions around that generic

    adds little regression.

    Before:

    # tc qdisc add dev eth0 root handle 1: htb default ffff
    # tc class add dev eth0 classid 1:ffff htb rate 5Gbit
    # tc -s class show dev eth0
    class htb 1:ffff root prio 0 rate 5000Mbit ceil 5000Mbit burst 625b cburst
    625b
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    rate 0bit 0pps backlog 0b 0p requeues 0
    lended: 0 borrowed: 0 giants: 0
    tokens: 31 ctokens: 31

    After:

    # tc qdisc add dev eth0 root handle 1: htb default ffff
    # tc class add dev eth0 classid 1:ffff htb rate 5Gbit
    # tc -s class show dev eth0
    class htb 1:ffff root prio 0 rate 1544Mbit ceil 1544Mbit burst 625b cburst
    625b
    Sent 5073 bytes 41 pkt (dropped 0, overlimits 0 requeues 0)
    rate 1976bit 2pps backlog 0b 0p requeues 0
    lended: 41 borrowed: 0 giants: 0
    tokens: 1802 ctokens: 1802

    This probably due to lost u64 cast of rate parameter in
    psched_ratecfg_precompute() (net/sched/sch_generic.c).

    Signed-off-by: Sergey Popovich
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Sergey Popovich
     

13 Feb, 2013

1 commit


12 Dec, 2012

1 commit

  • With BQL being deployed, we can more likely have following behavior :

    We dequeue a packet from qdisc in dequeue_skb(), then we realize target
    tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
    skb into gso_skb for later.

    This shows in stats (tc -s qdisc dev eth0) as requeues.

    Problem of these requeues is that high priority packets can not be
    dequeued as long as this (possibly low prio and big TSO packet) is not
    removed from gso_skb.

    At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

    In some cases, we know that all packets dequeued from a qdisc are
    for a particular and known txq :

    - If device is non multi queue
    - For all MQ/MQPRIO slave qdiscs

    This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
    this capability, so that dequeue_skb() is allowed to dequeue a packet
    only if the associated txq is not stopped.

    This indeed reduce latencies for high prio packets (or improve fairness
    with sfq/fq_codel), and almost remove qdisc 'requeues'.

    Signed-off-by: Eric Dumazet
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Sep, 2012

1 commit

  • It seems we need to provide ability for stacked devices
    to use specific lock_class_key for sch->busylock

    We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but
    a user might use a qdisc anyway.

    (So same fixes are probably needed on non LLTX stacked drivers)

    Noticed while stressing L2TPV3 setup :

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.6.0-rc3+ #788 Not tainted
    -------------------------------------------------------
    netperf/4660 is trying to acquire lock:
    (l2tpsock){+.-...}, at: [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]

    but task is already holding lock:
    (&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&(&sch->busylock)->rlock){+.-...}:
    [] lock_acquire+0x90/0x200
    [] _raw_spin_lock_irqsave+0x4c/0x60
    [] __wake_up+0x32/0x70
    [] tty_wakeup+0x3e/0x80
    [] pty_write+0x73/0x80
    [] tty_put_char+0x3c/0x40
    [] process_echoes+0x142/0x330
    [] n_tty_receive_buf+0x8fb/0x1230
    [] flush_to_ldisc+0x142/0x1c0
    [] process_one_work+0x198/0x760
    [] worker_thread+0x186/0x4b0
    [] kthread+0x93/0xa0
    [] kernel_thread_helper+0x4/0x10

    -> #0 (l2tpsock){+.-...}:
    [] __lock_acquire+0x1628/0x1b10
    [] lock_acquire+0x90/0x200
    [] _raw_spin_lock+0x41/0x50
    [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
    [] dev_hard_start_xmit+0x502/0xa70
    [] sch_direct_xmit+0xfe/0x290
    [] dev_queue_xmit+0x1e5/0xe00
    [] ip_finish_output+0x3d0/0x890
    [] ip_output+0x59/0xf0
    [] ip_local_out+0x2d/0xa0
    [] ip_queue_xmit+0x1c3/0x680
    [] tcp_transmit_skb+0x402/0xa60
    [] tcp_write_xmit+0x1f4/0xa30
    [] tcp_push_one+0x30/0x40
    [] tcp_sendmsg+0xe82/0x1040
    [] inet_sendmsg+0x125/0x230
    [] sock_sendmsg+0xdc/0xf0
    [] sys_sendto+0xfe/0x130
    [] system_call_fastpath+0x16/0x1b
    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&(&sch->busylock)->rlock);
    lock(l2tpsock);
    lock(&(&sch->busylock)->rlock);
    lock(l2tpsock);

    *** DEADLOCK ***

    5 locks held by netperf/4660:
    #0: (sk_lock-AF_INET){+.+.+.}, at: [] tcp_sendmsg+0x2c/0x1040
    #1: (rcu_read_lock){.+.+..}, at: [] ip_queue_xmit+0x0/0x680
    #2: (rcu_read_lock_bh){.+....}, at: [] ip_finish_output+0x135/0x890
    #3: (rcu_read_lock_bh){.+....}, at: [] dev_queue_xmit+0x0/0xe00
    #4: (&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

    stack backtrace:
    Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788
    Call Trace:
    [] print_circular_bug+0x1fb/0x20c
    [] __lock_acquire+0x1628/0x1b10
    [] ? check_usage+0x9b/0x4d0
    [] ? __lock_acquire+0x2e4/0x1b10
    [] lock_acquire+0x90/0x200
    [] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] _raw_spin_lock+0x41/0x50
    [] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
    [] dev_hard_start_xmit+0x502/0xa70
    [] ? dev_hard_start_xmit+0x5e/0xa70
    [] ? dev_queue_xmit+0x141/0xe00
    [] sch_direct_xmit+0xfe/0x290
    [] dev_queue_xmit+0x1e5/0xe00
    [] ? dev_hard_start_xmit+0xa70/0xa70
    [] ip_finish_output+0x3d0/0x890
    [] ? ip_finish_output+0x135/0x890
    [] ip_output+0x59/0xf0
    [] ip_local_out+0x2d/0xa0
    [] ip_queue_xmit+0x1c3/0x680
    [] ? ip_local_out+0xa0/0xa0
    [] tcp_transmit_skb+0x402/0xa60
    [] ? tcp_md5_do_lookup+0x18e/0x1a0
    [] tcp_write_xmit+0x1f4/0xa30
    [] tcp_push_one+0x30/0x40
    [] tcp_sendmsg+0xe82/0x1040
    [] inet_sendmsg+0x125/0x230
    [] ? inet_create+0x6b0/0x6b0
    [] ? sock_update_classid+0xc2/0x3b0
    [] ? sock_update_classid+0x130/0x3b0
    [] sock_sendmsg+0xdc/0xf0
    [] ? fget_light+0x3f9/0x4f0
    [] sys_sendto+0xfe/0x130
    [] ? trace_hardirqs_on+0xd/0x10
    [] ? _raw_spin_unlock_irq+0x30/0x50
    [] ? finish_task_switch+0x83/0xf0
    [] ? finish_task_switch+0x46/0xf0
    [] ? sysret_check+0x1b/0x56
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Aug, 2012

1 commit

  • I believe net/core/dev.c is a better place for netif_notify_peers(),
    because other net event notify functions also stay in this file.

    And rename it to netdev_notify_peers().

    Cc: David S. Miller
    Cc: Ian Campbell
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Amerigo Wang