17 Dec, 2014

1 commit

  • [ Upstream commit 69204cf7eb9c5a72067ce6922d4699378251d053 ]

    commit 46e5da40ae (net: qdisc: use rcu prefix and silence
    sparse warnings) triggers a spurious warning:

    net/sched/sch_fq_codel.c:97 suspicious rcu_dereference_check() usage!

    The code should be using the _bh variant of rcu_dereference.

    Signed-off-by: Valdis Kletnieks
    Acked-by: Eric Dumazet
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Valdis.Kletnieks@vt.edu
     

30 Oct, 2014

1 commit


22 Oct, 2014

1 commit


10 Oct, 2014

1 commit

  • Restore the quota fairness between qdisc's, that we broke with commit
    5772e9a346 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE").

    Before that commit, the quota in __qdisc_run() were in packets as
    dequeue_skb() would only dequeue a single packet, that assumption
    broke with bulk dequeue.

    We choose not to account for the number of packets inside the TSO/GSO
    packets (accessable via "skb_gso_segs"). As the previous fairness
    also had this "defect". Thus, GSO/TSO packets counts as a single
    packet.

    Further more, we choose to slack on accuracy, by allowing a bulk
    dequeue try_bulk_dequeue_skb() to exceed the "packets" limit, only
    limited by the BQL bytelimit. This is done because BQL prefers to get
    its full budget for appropriate feedback from TX completion.

    In future, we might consider reworking this further and, if it allows,
    switch to a time-based model, as suggested by Eric. Right now, we only
    restore old semantics.

    Joint work with Eric, Hannes, Daniel and Jesper. Hannes wrote the
    first patch in cooperation with Daniel and Jesper. Eric rewrote the
    patch.

    Fixes: 5772e9a346 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

09 Oct, 2014

2 commits


08 Oct, 2014

1 commit

  • Testing xmit_more support with netperf and connected UDP sockets,
    I found strange dst refcount false sharing.

    Current handling of IFF_XMIT_DST_RELEASE is not optimal.

    Dropping dst in validate_xmit_skb() is certainly too late in case
    packet was queued by cpu X but dequeued by cpu Y

    The logical point to take care of drop/force is in __dev_queue_xmit()
    before even taking qdisc lock.

    As Julian Anastasov pointed out, need for skb_dst() might come from some
    packet schedulers or classifiers.

    This patch adds new helper to cleanly express needs of various drivers
    or qdiscs/classifiers.

    Drivers that need skb_dst() in their ndo_start_xmit() should call
    following helper in their setup instead of the prior :

    dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
    ->
    netif_keep_dst(dev);

    Instead of using a single bit, we use two bits, one being
    eventually rebuilt in bonding/team drivers.

    The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
    rebuilt in bonding/team. Eventually, we could add something
    smarter later.

    Signed-off-by: Eric Dumazet
    Cc: Julian Anastasov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Oct, 2014

3 commits

  • Using the tcf_proto pointer 'tp' from inside the classifiers callback
    is not valid because it may have been cleaned up by another call_rcu
    occuring on another CPU.

    'tp' is currently being used by tcf_unbind_filter() in this patch we
    move instances of tcf_unbind_filter outside of the call_rcu() context.
    This is safe to do because any running schedulers will either read the
    valid class field or it will be zeroed.

    And all schedulers today when the class is 0 do a lookup using the
    same call used by the tcf_exts_bind(). So even if we have a running
    classifier hit the null class pointer it will do a lookup and get
    to the same result. This is particularly fragile at the moment because
    the only way to verify this is to audit the schedulers call sites.

    Reported-by: Cong Wang
    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     
  • It is not RCU safe to destroy the action chain while there
    is a possibility of readers accessing it. Move this code
    into the rcu callback using the same rcu callback used in the
    code patch to make a change to head.

    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This removes the tcf_proto argument from the ematch code paths that
    only need it to reference the net namespace. This allows simplifying
    qdisc code paths especially when we need to tear down the ematch
    from an RCU callback. In this case we can not guarentee that the
    tcf_proto structure is still valid.

    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     

06 Oct, 2014

1 commit

  • Standard qdisc API to setup a timer implies an atomic operation on every
    packet dequeue : qdisc_unthrottled()

    It turns out this is not really needed for FQ, as FQ has no concept of
    global qdisc throttling, being a qdisc handling many different flows,
    some of them can be throttled, while others are not.

    Fix is straightforward : add a 'bool throttle' to
    qdisc_watchdog_schedule_ns(), and remove calls to qdisc_unthrottled()
    in sch_fq.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Oct, 2014

2 commits

  • The result of a negated container has to be inverted before checking for
    early ending.

    This fixes my previous attempt (17c9c8232663a47f074b7452b9b034efda868ca7) to
    make inverted containers work correctly.

    Signed-off-by: Ignacy Gawędzki
    Signed-off-by: David S. Miller

    Ignacy Gawędzki
     
  • Suspicious RCU usage in qdisc_watchdog call needs to be done inside
    rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
    need to ensure timer is cancelled before removing qdisc structure.

    [ 3992.191339] ===============================
    [ 3992.191340] [ INFO: suspicious RCU usage. ]
    [ 3992.191343] 3.17.0-rc6net-next+ #72 Not tainted
    [ 3992.191345] -------------------------------
    [ 3992.191347] include/net/sch_generic.h:272 suspicious rcu_dereference_check() usage!
    [ 3992.191348]
    [ 3992.191348] other info that might help us debug this:
    [ 3992.191348]
    [ 3992.191351]
    [ 3992.191351] rcu_scheduler_active = 1, debug_locks = 1
    [ 3992.191353] no locks held by swapper/1/0.
    [ 3992.191355]
    [ 3992.191355] stack backtrace:
    [ 3992.191358] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.0-rc6net-next+ #72
    [ 3992.191360] Hardware name: /DZ77RE-75K, BIOS GAZ7711H.86A.0060.2012.1115.1750 11/15/2012
    [ 3992.191362] 0000000000000001 ffff880235803e48 ffffffff8178f92c 0000000000000000
    [ 3992.191366] ffff8802322224a0 ffff880235803e78 ffffffff810c9966 ffff8800a5fe3000
    [ 3992.191370] ffff880235803f30 ffff8802359cd768 ffff8802359cd6e0 ffff880235803e98
    [ 3992.191374] Call Trace:
    [ 3992.191376] [] dump_stack+0x4e/0x68
    [ 3992.191387] [] lockdep_rcu_suspicious+0xe6/0x130
    [ 3992.191392] [] qdisc_watchdog+0x8a/0xb0
    [ 3992.191396] [] __run_hrtimer+0x72/0x420
    [ 3992.191399] [] ? hrtimer_interrupt+0x7d/0x240
    [ 3992.191403] [] ? tc_classify+0xc0/0xc0
    [ 3992.191406] [] hrtimer_interrupt+0xff/0x240
    [ 3992.191410] [] ? __atomic_notifier_call_chain+0x5/0x140
    [ 3992.191415] [] local_apic_timer_interrupt+0x3b/0x60
    [ 3992.191419] [] smp_apic_timer_interrupt+0x45/0x60
    [ 3992.191422] [] apic_timer_interrupt+0x6f/0x80
    [ 3992.191424] [] ? cpuidle_enter_state+0x73/0x2e0
    [ 3992.191432] [] ? cpuidle_enter_state+0x6e/0x2e0
    [ 3992.191437] [] cpuidle_enter+0x17/0x20
    [ 3992.191441] [] cpu_startup_entry+0x3d1/0x4a0
    [ 3992.191445] [] ? clockevents_config_and_register+0x26/0x30
    [ 3992.191448] [] start_secondary+0x1b6/0x260

    Fixes: b26b0d1e8b1 ("net: qdisc: use rcu prefix and silence sparse warnings")
    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     

04 Oct, 2014

3 commits

  • Validation of skb can be pretty expensive :

    GSO segmentation and/or checksum computations.

    We can do this without holding qdisc lock, so that other cpus
    can queue additional packets.

    Trick is that requeued packets were already validated, so we carry
    a boolean so that sch_direct_xmit() can validate a fresh skb list,
    or directly use an old one.

    Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
    host.

    Turning TSO on or off had no effect on throughput, only few more cpu
    cycles. Lock contention on qdisc lock disappeared.

    Same if disabling TX checksum offload.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The TSO and GSO segmented packets already benefit from bulking
    on their own.

    The TSO packets have always taken advantage of the only updating
    the tailptr once for a large packet.

    The GSO segmented packets have recently taken advantage of
    bulking xmit_more API, via merge commit 53fda7f7f9e8 ("Merge
    branch 'xmit_list'"), specifically via commit 7f2e870f2a4 ("net:
    Move main gso loop out of dev_hard_start_xmit() into helper.")
    allowing qdisc requeue of remaining list. And via commit
    ce93718fb7cd ("net: Don't keep around original SKB when we
    software segment GSO frames.").

    This patch allow further bulking of TSO/GSO packets together,
    when dequeueing from the qdisc.

    Testing:
    Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
    netperf-wrapper. Bulking several TSO show no performance regressions
    (requeues were in the area 32 requeues/sec).

    Bulking several GSOs does show small regression or very small
    improvement (requeues were in the area 8000 requeues/sec).

    Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
    latency. Base-case, which is "normal" GSO bulking, sees varying
    high-prio queue delay between 0.38ms to 0.47ms. Bulking several GSOs
    together, result in a stable high-prio queue delay of 0.50ms.

    Using igb at 100Mbit/s with GSO bulking, shows an improvement.
    Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms

    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
    sending/processing an entire skb list.

    This patch implements qdisc bulk dequeue, by allowing multiple packets
    to be dequeued in dequeue_skb().

    The optimization principle for this is two fold, (1) to amortize
    locking cost and (2) avoid expensive tailptr update for notifying HW.
    (1) Several packets are dequeued while holding the qdisc root_lock,
    amortizing locking cost over several packet. The dequeued SKB list is
    processed under the TXQ lock in dev_hard_start_xmit(), thus also
    amortizing the cost of the TXQ lock.
    (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
    API to delay HW tailptr update, which also reduces the cost per
    packet.

    One restriction of the new API is that every SKB must belong to the
    same TXQ. This patch takes the easy way out, by restricting bulk
    dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
    qdisc only have attached a single TXQ.

    Some detail about the flow; dev_hard_start_xmit() will process the skb
    list, and transmit packets individually towards the driver (see
    xmit_one()). In case the driver stops midway in the list, the
    remaining skb list is returned by dev_hard_start_xmit(). In
    sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

    To avoid overshooting the HW limits, which results in requeuing, the
    patch limits the amount of bytes dequeued, based on the drivers BQL
    limits. In-effect bulking will only happen for BQL enabled drivers.

    Small amounts for extra HoL blocking (2x MTU/0.24ms) were
    measured at 100Mbit/s, with bulking 8 packets, but the
    oscillating nature of the measurement indicate something, like
    sched latency might be causing this effect. More comparisons
    show, that this oscillation goes away occationally. Thus, we
    disregard this artifact completely and remove any "magic" bulking
    limit.

    For now, as a conservative approach, stop bulking when seeing TSO and
    segmented GSO packets. They already benefit from bulking on their own.
    A followup patch add this, to allow easier bisect-ability for finding
    regressions.

    Jointed work with Hannes, Daniel and Florian.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

03 Oct, 2014

1 commit


02 Oct, 2014

2 commits

  • This fixes the following crash:

    [ 63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [ 63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648
    [ 63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000
    [ 63.980094] RIP: 0010:[] [] u32_destroy_key+0x27/0x6d
    [ 63.980094] RSP: 0018:ffff880117dffcc0 EFLAGS: 00010202
    [ 63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000
    [ 63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
    [ 63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000
    [ 63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001
    [ 63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30
    [ 63.980094] FS: 0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
    [ 63.980094] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0
    [ 63.980094] Stack:
    [ 63.980094] ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68
    [ 63.980094] ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8
    [ 63.980094] ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a
    [ 63.980094] Call Trace:
    [ 63.980094] [] u32_delete_key_freepf_rcu+0x1b/0x1d
    [ 63.980094] [] rcu_process_callbacks+0x3bb/0x691
    [ 63.980094] [] ? rcu_process_callbacks+0x2c1/0x691
    [ 63.980094] [] ? u32_destroy_key+0x6d/0x6d
    [ 63.980094] [] __do_softirq+0x142/0x323
    [ 63.980094] [] run_ksoftirqd+0x23/0x53
    [ 63.980094] [] smpboot_thread_fn+0x203/0x221
    [ 63.980094] [] ? smpboot_unpark_thread+0x33/0x33
    [ 63.980094] [] kthread+0xc9/0xd1
    [ 63.980094] [] ? do_wait_for_common+0xf8/0x125
    [ 63.980094] [] ? __kthread_parkme+0x61/0x61
    [ 63.980094] [] ret_from_fork+0x7c/0xb0
    [ 63.980094] [] ? __kthread_parkme+0x61/0x61

    tp could be freed in call_rcu callback too, the order is not guaranteed.

    John Fastabend says:

    ====================
    Its worth noting why this is safe. Any running schedulers will either
    read the valid class field or it will be zeroed.

    All schedulers today when the class is 0 do a lookup using the
    same call used by the tcf_exts_bind(). So even if we have a running
    classifier hit the null class pointer it will do a lookup and get
    to the same result. This is particularly fragile at the moment because
    the only way to verify this is to audit the schedulers call sites.
    ====================

    Cc: John Fastabend
    Signed-off-by: Cong Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    WANG Cong
     
  • This patch fixes the following crash:

    [ 166.670795] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 166.674230] IP: [] __list_del_entry+0x5c/0x98
    [ 166.674230] PGD d0ea5067 PUD ce7fc067 PMD 0
    [ 166.674230] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [ 166.674230] CPU: 1 PID: 775 Comm: tc Not tainted 3.17.0-rc6+ #642
    [ 166.674230] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 166.674230] task: ffff8800d03c4d20 ti: ffff8800cae7c000 task.ti: ffff8800cae7c000
    [ 166.674230] RIP: 0010:[] [] __list_del_entry+0x5c/0x98
    [ 166.674230] RSP: 0018:ffff8800cae7f7d0 EFLAGS: 00010207
    [ 166.674230] RAX: 0000000000000000 RBX: ffff8800cba8d700 RCX: ffff8800cba8d700
    [ 166.674230] RDX: 0000000000000000 RSI: dead000000200200 RDI: ffff8800cba8d700
    [ 166.674230] RBP: ffff8800cae7f7d0 R08: 0000000000000001 R09: 0000000000000001
    [ 166.674230] R10: 0000000000000000 R11: 000000000000859a R12: ffffffffffffffe8
    [ 166.674230] R13: ffff8800cba8c5b8 R14: 0000000000000001 R15: ffff8800cba8d700
    [ 166.674230] FS: 00007fdb5f04a740(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
    [ 166.674230] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 166.674230] CR2: 0000000000000000 CR3: 00000000cf929000 CR4: 00000000000006e0
    [ 166.674230] Stack:
    [ 166.674230] ffff8800cae7f7e8 ffffffff814b73e8 ffff8800cba8d6e8 ffff8800cae7f828
    [ 166.674230] ffffffff817caeec 0000000000000046 ffff8800cba8c5b0 ffff8800cba8c5b8
    [ 166.674230] 0000000000000000 0000000000000001 ffff8800cf8e33e8 ffff8800cae7f848
    [ 166.674230] Call Trace:
    [ 166.674230] [] list_del+0xd/0x2b
    [ 166.674230] [] tcf_action_destroy+0x4c/0x71
    [ 166.674230] [] tcf_exts_destroy+0x20/0x2d
    [ 166.674230] [] tcindex_delete+0x196/0x1b7

    struct list_head can not be simply copied and we should always init it.

    Cc: John Fastabend
    Signed-off-by: Cong Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    WANG Cong
     

30 Sep, 2014

5 commits

  • After previous patches to simplify qstats the qstats can be
    made per cpu with a packed union in Qdisc struct.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This removes the use of qstats->qlen variable from the classifiers
    and makes it an explicit argument to gnet_stats_copy_queue().

    The qlen represents the qdisc queue length and is packed into
    the qstats at the last moment before passnig to user space. By
    handling it explicitely we avoid, in the percpu stats case, having
    to figure out which per_cpu variable to put it in.

    It would probably be best to remove it from qstats completely
    but qstats is a user space ABI and can't be broken. A future
    patch could make an internal only qstats structure that would
    avoid having to allocate an additional u32 variable on the
    Qdisc struct. This would make the qstats struct 128bits instead
    of 128+32.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This adds helpers to manipulate qstats logic and replaces locations
    that touch the counters directly. This simplifies future patches
    to push qstats onto per cpu counters.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • In order to run qdisc's without locking statistics and estimators
    need to be handled correctly.

    To resolve bstats make the statistics per cpu. And because this is
    only needed for qdiscs that are running without locks which is not
    the case for most qdiscs in the near future only create percpu
    stats when qdiscs set the TCQ_F_CPUSTATS flag.

    Next because estimators use the bstats to calculate packets per
    second and bytes per second the estimator code paths are updated
    to use the per cpu statistics.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Negated expressions and sub-expressions need to have their flags checked for
    TCF_EM_INVERT and their result negated accordingly.

    Signed-off-by: Ignacy Gawędzki
    Signed-off-by: David S. Miller

    Ignacy Gawędzki
     

29 Sep, 2014

4 commits


26 Sep, 2014

1 commit

  • While using a MQ + NETEM setup, I had confirmation that the default
    timer migration ( /proc/sys/kernel/timer_migration ) is killing us.

    Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
    queues) :

    EST="est 1sec 4sec"
    for ETH in eth1
    do
    tc qd del dev $ETH root 2>/dev/null
    tc qd add dev $ETH root handle 1: mq
    tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
    tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
    tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
    tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
    tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
    tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
    tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
    tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
    done

    We can see that timers get migrated into a single cpu, presumably idle
    at the time timers are set up.
    Then all qdisc dequeues run from this cpu and huge lock contention
    happens. This single cpu is stuck in softirq mode and cannot dequeue
    fast enough.

    39.24% [kernel] [k] _raw_spin_lock
    2.65% [kernel] [k] netem_enqueue
    1.80% [kernel] [k] netem_dequeue
    1.63% [kernel] [k] copy_user_enhanced_fast_string
    1.45% [kernel] [k] _raw_spin_lock_bh

    By pinning qdisc timers on the cpu running the qdisc, we respect proper
    XPS setting and remove this lock contention.

    5.84% [kernel] [k] netem_enqueue
    4.83% [kernel] [k] _raw_spin_lock
    2.92% [kernel] [k] copy_user_enhanced_fast_string

    Current Qdiscs that benefit from this change are :

    netem, cbq, fq, hfsc, tbf, htb.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Sep, 2014

1 commit


23 Sep, 2014

4 commits

  • $ grep CONFIG_CLS_U32_MARK .config
    # CONFIG_CLS_U32_MARK is not set

    net/sched/cls_u32.c: In function 'u32_change':
    net/sched/cls_u32.c:852:1: warning: label 'errout' defined but not used
    [-Wunused-label]

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Changes to the cls_u32 classifier must appear atomic to the
    readers. Before this patch if a change is requested for both
    the exts and ifindex, first the ifindex is updated then the
    exts with tcf_exts_change(). This opens a small window where
    a reader can have a exts chain with an incorrect ifindex. This
    violates the the RCU semantics.

    Here we resolve this by always passing u32_set_parms() a copy
    of the tc_u_knode to work on and then inserting it into the hash
    table after the updates have been successfully applied.

    Tested with the following short script:

    #tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 handle 1: \
    u32 divisor 256

    #tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 \
    u32 link 1: hashkey mask ffffff00 at 12 \
    match ip src 192.168.8.0/2

    #tc filter add dev p3p2 parent 8001:0 protocol ip prio 102 \
    handle 1::10 u32 classid 1:2 ht 1: \
    match ip src 192.168.8.0/8 match ip tos 0x0a 1e

    #tc filter change dev p3p2 parent 8001:0 protocol ip prio 102 \
    handle 1::10 u32 classid 1:2 ht 1: \
    match ip src 1.1.0.0/8 match ip tos 0x0b 1e

    CC: Eric Dumazet
    CC: Jamal Hadi Salim
    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This fixes a missed free_percpu in the unwind code path and when
    keys are destroyed.

    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend
     
  • We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
    or increasing skb->cb[] size.

    Commit e0f31d849867 ("flow_keys: Record IP layer protocol in
    skb_flow_dissect()") broke IPoIB.

    Only current offender is sch_choke, and this one do not need an
    absolutely precise flow key.

    If we store 17 bytes of flow key, its more than enough. (Its the actual
    size of flow_keys if it was a packed structure, but we might add new
    fields at the end of it later)

    Signed-off-by: Eric Dumazet
    Fixes: e0f31d849867 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Sep, 2014

2 commits


17 Sep, 2014

4 commits

  • This ensures the tcf_exts_init() is called for all cases.

    Fixes: 952313bd62589cae216a57 ("net: sched: cls_cgroup use RCU")
    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     
  • When allocating a new structure we also need to call tcf_exts_init
    to initialize exts.

    A follow up patch might be in order to remove some of this code
    and do tcf_exts_assign(). With this we could remove the
    tcf_exts_init/tcf_exts_change pattern for some of the classifiers.
    As part of the future tcf_actions RCU series this will need to be
    done. For now fix the call here.

    Fixes e35a8ee5993ba81fd6c0 ("net: sched: fw use RCU")
    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     
  • tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
    head: 54996b529ab70ca1d6f40677cd2698c4f7127e87
    commit: c7953ef23042b7c4fc2be5ecdd216aacff6df5eb [625/646] net: sched: cls_cgroup use RCU

    net/sched/cls_cgroup.c:130 cls_cgroup_change() warn: possible memory leak of 'new'
    net/sched/cls_cgroup.c:135 cls_cgroup_change() warn: possible memory leak of 'new'
    net/sched/cls_cgroup.c:139 cls_cgroup_change() warn: possible memory leak of 'new'

    Fixes: c7953ef23042b7c4fc2be5ecdd216aac ("net: sched: cls_cgroup use RCU")
    Reported-by: Dan Carpenter
    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Add missing rcu_assign_pointer and missing annotation for ht_up
    in cls_u32.c

    Caught by kbuild bot,

    >> net/sched/cls_u32.c:378:36: sparse: incorrect type in initializer (different address spaces)
    net/sched/cls_u32.c:378:36: expected struct tc_u_hnode *ht
    net/sched/cls_u32.c:378:36: got struct tc_u_hnode [noderef] *ht_up
    >> net/sched/cls_u32.c:610:54: sparse: incorrect type in argument 4 (different address spaces)
    net/sched/cls_u32.c:610:54: expected struct tc_u_hnode *ht
    net/sched/cls_u32.c:610:54: got struct tc_u_hnode [noderef] *ht_up
    >> net/sched/cls_u32.c:684:18: sparse: incorrect type in assignment (different address spaces)
    net/sched/cls_u32.c:684:18: expected struct tc_u_hnode [noderef] *ht_up
    net/sched/cls_u32.c:684:18: got struct tc_u_hnode *[assigned] ht
    >> net/sched/cls_u32.c:359:18: sparse: dereference of noderef expression

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend