07 Jun, 2015

1 commit

  • eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
    For ingress L2 header is already pulled, whereas for egress it's present.
    This is known to program writers which are currently forced to use
    BPF_LL_OFF workaround.
    Since programs don't change skb internal pointers it is safe to do
    pull/push right around invocation of the program and earlier taps and
    later pt->func() will not be affected.
    Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
    around run_filter/BPF_PROG_RUN even if skb_shared.

    This fix finally allows programs to use optimized LD_ABS/IND instructions
    without BPF_LL_OFF for higher performance.
    tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
    w/o JIT w/JIT
    before 20.5 23.6 Mpps
    after 21.8 26.6 Mpps

    Old programs with BPF_LL_OFF will still work as-is.

    We can now undo most of the earlier workaround commit:
    a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets")

    Signed-off-by: Alexei Starovoitov
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

05 Jun, 2015

2 commits

  • This patch adds full IPv6 addresses into flow_keys and uses them as
    input to the flow hash function. The implementation supports either
    IPv4 or IPv6 addresses in a union, and selector is used to determine
    how may words to input to jhash2.

    We also add flow_get_u32_dst and flow_get_u32_src functions which are
    used to get a u32 representation of the source and destination
    addresses. For IPv6, ipv6_addr_hash is called. These functions retain
    getting the legacy values of src and dst in flow_keys.

    With this patch, Ethertype and IP protocol are now included in the
    flow hash input.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This patch changes flow hashing to use jhash2 over the flow_keys
    structure instead just doing jhash_3words over src, dst, and ports.
    This method will allow us take more input into the hashing function
    so that we can include full IPv6 addresses, VLAN, flow labels etc.
    without needing to resort to xor'ing which makes for a poor hash.

    Acked-by: Jiri Pirko
    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

02 Jun, 2015

1 commit

  • Conflicts:
    drivers/net/phy/amd-xgbe-phy.c
    drivers/net/wireless/iwlwifi/Kconfig
    include/net/mac80211.h

    iwlwifi/Kconfig and mac80211.h were both trivial overlapping
    changes.

    The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
    the bug fix that happened on the 'net' side is already integrated
    into the rest of the amd-xgbe driver.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 May, 2015

1 commit

  • For mq qdisc, we add per tx queue qdisc to root qdisc
    for display purpose, however, that happens too early,
    before the new dev->qdisc is finally set, this causes
    q->list points to an old root qdisc which is going to be
    freed right before assigning with a new one.

    Fix this by moving ->attach() after setting dev->qdisc.

    For the record, this fixes the following crash:

    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
    list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b
    CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756
    ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20
    ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000
    Call Trace:
    [] dump_stack+0x4c/0x65
    [] warn_slowpath_common+0x9c/0xb6
    [] ? __list_del_entry+0x5a/0x98
    [] warn_slowpath_fmt+0x46/0x48
    [] ? dev_graft_qdisc+0x5e/0x6a
    [] __list_del_entry+0x5a/0x98
    [] list_del+0xe/0x2d
    [] qdisc_list_del+0x1e/0x20
    [] qdisc_destroy+0x30/0xd6
    [] qdisc_graft+0x11d/0x243
    [] tc_get_qdisc+0x1a6/0x1d4
    [] ? mark_lock+0x2e/0x226
    [] rtnetlink_rcv_msg+0x181/0x194
    [] ? rtnl_lock+0x17/0x19
    [] ? rtnl_lock+0x17/0x19
    [] ? __rtnl_unlock+0x17/0x17
    [] netlink_rcv_skb+0x4d/0x93
    [] rtnetlink_rcv+0x26/0x2d
    [] netlink_unicast+0xcb/0x150
    [] ? might_fault+0x59/0xa9
    [] netlink_sendmsg+0x4fa/0x51c
    [] sock_sendmsg_nosec+0x12/0x1d
    [] sock_sendmsg+0x29/0x2e
    [] ___sys_sendmsg+0x1b4/0x23a
    [] ? native_sched_clock+0x35/0x37
    [] ? sched_clock_local+0x12/0x72
    [] ? sched_clock_cpu+0x9e/0xb7
    [] ? current_kernel_time+0xe/0x32
    [] ? lock_release_holdtime.part.29+0x71/0x7f
    [] ? read_seqcount_begin.constprop.27+0x5f/0x76
    [] ? trace_hardirqs_on_caller+0x17d/0x199
    [] ? __fget_light+0x50/0x78
    [] __sys_sendmsg+0x42/0x60
    [] SyS_sendmsg+0x12/0x1c
    [] system_call_fastpath+0x12/0x6f
    ---[ end trace ef29d3fb28e97ae7 ]---

    For long term, we probably need to clean up the qdisc_graft() code
    in case it hides other bugs like this.

    Fixes: 95dc19299f74 ("pkt_sched: give visibility to mq slave qdiscs")
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    WANG Cong
     

23 May, 2015

1 commit

  • Conflicts:
    drivers/net/ethernet/cadence/macb.c
    drivers/net/phy/phy.c
    include/linux/skbuff.h
    net/ipv4/tcp.c
    net/switchdev/switchdev.c

    Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
    renaming overlapping with net-next changes of various
    sorts.

    phy.c was a case of two changes, one adding a local
    variable to a function whilst the second was removing
    one.

    tcp.c overlapped a deadlock fix with the addition of new tcp_info
    statistic values.

    macb.c involved the addition of two zyncq device entries.

    skbuff.h involved adding back ipv4_daddr to nf_bridge_info
    whilst net-next changes put two other existing members of
    that struct into a union.

    Signed-off-by: David S. Miller

    David S. Miller
     

22 May, 2015

1 commit

  • Vijay reported that a loop as simple as ...

    while true; do
    tc qdisc add dev foo root handle 1: prio
    tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1
    tc qdisc del dev foo root
    rmmod cls_u32
    done

    ... will panic the kernel. Moreover, he bisected the change
    apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
    locking to netlink_lookup() and seq walker").

    The removal of synchronize_net() from the netlink socket
    triggering the qdisc to be removed, seems to have uncovered
    an RCU resp. module reference count race from the tc API.
    Given that RCU conversion was done after e341694e3eb5 ("netlink:
    Convert netlink_lookup() to use RCU protected hash table")
    which added the synchronize_net() originally, occasion of
    hitting the bug was less likely (not impossible though):

    When qdiscs that i) support attaching classifiers and,
    ii) have at least one of them attached, get deleted, they
    invoke tcf_destroy_chain(), and thus call into ->destroy()
    handler from a classifier module.

    After RCU conversion, all classifier that have an internal
    prio list, unlink them and initiate freeing via call_rcu()
    deferral.

    Meanhile, tcf_destroy() releases already reference to the
    tp->ops->owner module before the queued RCU callback handler
    has been invoked.

    Subsequent rmmod on the classifier module is then not prevented
    since all module references are already dropped.

    By the time, the kernel invokes the RCU callback handler from
    the module, that function address is then invalid.

    One way to fix it would be to add an rcu_barrier() to
    unregister_tcf_proto_ops() to wait for all pending call_rcu()s
    to complete.

    synchronize_rcu() is not appropriate as under heavy RCU
    callback load, registered call_rcu()s could be deferred
    longer than a grace period. In case we don't have any pending
    call_rcu()s, the barrier is allowed to return immediately.

    Since we came here via unregister_tcf_proto_ops(), there
    are no users of a given classifier anymore. Further nested
    call_rcu()s pointing into the module space are not being
    done anywhere.

    Only cls_bpf_delete_prog() may schedule a work item, to
    unlock pages eventually, but that is not in the range/context
    of cls_bpf anymore.

    Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
    Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
    Reported-by: Vijay Subramanian
    Signed-off-by: Daniel Borkmann
    Cc: John Fastabend
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Jamal Hadi Salim
    Cc: Alexei Starovoitov
    Tested-by: Vijay Subramanian
    Acked-by: Alexei Starovoitov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

15 May, 2015

1 commit

  • Fix compile error in net/sched/cls_flower.c

    net/sched/cls_flower.c: In function ‘fl_set_key’:
    net/sched/cls_flower.c:240:3: error: implicit declaration of
    function ‘tcf_change_indev’ [-Werror=implicit-function-declaration]
    err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]);

    Introduced in 77b9900ef53ae

    Fixes: 77b9900ef53ae ("tc: introduce Flower classifier")
    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley
     

14 May, 2015

7 commits

  • This new config switch enables the ingress filtering infrastructure that is
    controlled through the ingress_needed static key. This prepares the
    introduction of the Netfilter ingress hook that resides under this unique
    static key.

    Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
    problem since this also depends on CONFIG_NET_CLS_ACT.

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • This patch introduces a flow-based filter. So far, the very essential
    packet fields are supported.

    This patch is only the first step. There is a lot of potential performance
    improvements possible to implement. Also a lot of features are missing
    now. They will be addressed in follow-up patches.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • add couple of empty lines on the way.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Seems all we want here is to avoid endless 'goto reclassify' loop.
    tc_classify_compat even resets this counter when something other
    than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
    break hypothetical loops induced by something other than perpetual
    TC_ACT_RECLASSIFY return values.

    skb_act_clone is now identical to skb_clone, so just use that.

    Tested with following (bogus) filter:
    tc filter add dev eth0 parent ffff: \
    protocol ip u32 match u32 0 0 police rate 10Kbit burst \
    64000 mtu 1500 action reclassify

    Acked-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Acked-by: Alexei Starovoitov
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Four minor merge conflicts:

    1) qca_spi.c renamed the local variable used for the SPI device
    from spi_device to spi, meanwhile the spi_set_drvdata() call
    got moved further up in the probe function.

    2) Two changes were both adding new members to codel params
    structure, and thus we had overlapping changes to the
    initializer function.

    3) 'net' was making a fix to sk_release_kernel() which is
    completely removed in 'net-next'.

    4) In net_namespace.c, the rtnl_net_fill() call for GET operations
    had the command value fixed, meanwhile 'net-next' adjusted the
    argument signature a bit.

    This also matches example merge resolutions posted by Stephen
    Rothwell over the past two days.

    Signed-off-by: David S. Miller

    David S. Miller
     

13 May, 2015

1 commit

  • In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop
    parameters configured, then packets for the default VQ are not subjected
    to RED and are only dropped if the queue is larger than the net_device's
    tx_queue_len. This behavior is useful for WRED mode, since these packets
    will still influence the calculated average queue length and (therefore)
    the drop probability for all of the other VQs. However, for some drivers
    tx_queue_len is zero. In other cases the user may wish to make the limit
    the same for all VQs (including the default VQ with no drop parameters).

    This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit,
    in bytes, during qdisc setup. (This limit is in bytes to be consistent
    with the drop parameters.) The default limit is the same as for a bfifo
    queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are
    configured with a smaller limit than the GRED queue limit, that VQ will
    still observe the smaller limit instead.

    Signed-off-by: David Ward
    Signed-off-by: David S. Miller

    David Ward
     

12 May, 2015

2 commits

  • Only left enqueue_root() user is netem, and it looks not necessary :

    qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In WRED mode, the backlog for a single virtual queue (VQ) should not be
    used to determine queue behavior; instead the backlog is summed across
    all VQs. This sum is currently used when calculating the average queue
    lengths. It also needs to be used when determining if the queue's hard
    limit has been reached, or when reporting each VQ's backlog via netlink.
    q->backlog will only be used if the queue switches out of WRED mode.

    Signed-off-by: David Ward
    Signed-off-by: David S. Miller

    David Ward
     

11 May, 2015

2 commits

  • Ingress qdisc has no other purpose than calling into tc_classify()
    that executes attached classifier(s) and action(s).

    It has a 1:1 relationship to dev->ingress_queue. After having commit
    087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
    the central ingress lock, one major contention point is gone.

    The extra indirection layers however, are not necessary for calling
    into ingress qdisc. pktgen calling locally into netif_receive_skb()
    with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
    E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.

    We can redirect the private classifier list to the netdev directly,
    without changing any classifier API bits (!) and execute on that from
    handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
    ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
    is also not applicable, ingress_cl_list provides similar behaviour.
    In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.

    One next possible step is the removal of the dev's ingress (dummy)
    netdev_queue, and to only have the list member in the netdevice
    itself.

    Note, the filter chain is RCU protected and individual filter elements
    are being kfree'd by sched subsystem after RCU grace period. RCU read
    lock is being held by __netif_receive_skb_core().

    Joint work with Alexei Starovoitov.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • For DCTCP or similar ECN based deployments on fabrics with shallow
    buffers, hosts are responsible for a good part of the buffering.

    This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
    so that DCTCP can have feedback from queuing in the host.

    A DCTCP enabled egress port simply have a queue occupancy threshold
    above which ECT packets get CE mark.

    In codel language this translates to a sojourn time, so that one doesn't
    have to worry about bytes or bandwidth but delays.

    This makes the host an active participant in the health of the whole
    network.

    This also helps experimenting DCTCP in a setup without DCTCP compliant
    fabric.

    On following example, ce_threshold is set to 1ms, and we can see from
    'ldelay xxx us' that TCP is not trying to go around the 5ms codel
    target.

    Queue has more capacity to absorb inelastic bursts (say from UDP
    traffic), as queues are maintained to an optimal level.

    lpaa23:~# ./tc -s -d qd sh dev eth1
    qdisc mq 1: dev eth1 root
    Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
    backlog 3108242b 364p requeues 42961
    qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
    Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
    rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
    count 0 lastcount 0 ldelay 1.0ms drop_next 0us
    maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
    qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
    Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
    rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
    count 0 lastcount 0 ldelay 694us drop_next 0us
    maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
    qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
    Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
    rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
    count 0 lastcount 0 ldelay 889us drop_next 0us
    maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
    ...

    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Cc: Daniel Borkmann
    Cc: Glenn Judd
    Cc: Nandita Dukkipati
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 May, 2015

1 commit

  • When tcf_destroy() returns true, tp could be already destroyed,
    we should not use tp->next after that.

    For long term, we probably should move tp list to list_head.

    Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

04 May, 2015

7 commits

  • Call make_flow_keys_digest to get a digest from flow keys and
    use that to pass skbuff cb and for comparing flows.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
    jhash by hand.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
    jhash by hand.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
    jhash by hand.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
    jhash by hand.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • TC classifiers/actions were converted to RCU by John in the series:
    http://thread.gmane.org/gmane.linux.network/329739/focus=329739
    and many follow on patches.
    This is the last patch from that series that finally drops
    ingress spin_lock.

    Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.

    In two cpu case when both cores are receiving traffic on the same
    device and go into the same ingress+u32 the performance jumps
    from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps

    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Jamal Hadi Salim
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
    useless. In following example, not a single packet was ever dropped,
    while average delay in codel queue is ~100 ms !

    qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
    Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 13626b 3p requeues 0
    count 0 lastcount 0 ldelay 96.9ms drop_next 0us
    maxpacket 9084 ecn_mark 0 drop_overlimit 0

    This comes from a confusion of what should be the minimal backlog. It is
    pretty clear it is not 64KB or whatever max GSO packet ever reached the
    qdisc.

    codel intent was to use MTU of the device.

    After the fix, we finally drop some packets, and rtt/cwnd of my single
    TCP flow are meeting our expectations.

    qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
    Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
    backlog 6056b 3p requeues 0
    count 1 lastcount 1 ldelay 36.3ms drop_next 0us
    maxpacket 10598 ecn_mark 0 drop_overlimit 0

    Signed-off-by: Eric Dumazet
    Cc: Kathleen Nichols
    Cc: Dave Taht
    Cc: Van Jacobson
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 May, 2015

1 commit

  • Not used.

    pedit sets TC_MUNGED when packet content was altered, but all the core
    does is unset MUNGED again and then set OK2MUNGE.

    And the latter isn't tested anywhere. So lets remove both
    TC_MUNGED and TC_OK2MUNGE.

    Signed-off-by: Florian Westphal
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 Apr, 2015

1 commit


18 Apr, 2015

1 commit

  • When you redirect a VLAN device to any device, you end up with
    crap in af_packet on the xmit path because hard_header_len is
    not equal to skb->mac_len. So the redirected packet contains
    four extra bytes at the start which then gets interpreted as
    part of the MAC address.

    This patch fixes this by only pushing skb->mac_len. We also
    need to fix ifb because it tries to undo the pushing done by
    act_mirred.

    Signed-off-by: Herbert Xu
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Herbert Xu
     

17 Apr, 2015

1 commit

  • For the short-term solution, lets fix bpf helper functions to use
    skb->mac_header relative offsets instead of skb->data in order to
    get the same eBPF programs with cls_bpf and act_bpf work on ingress
    and egress qdisc path. We need to ensure that mac_header is set
    before calling into programs. This is effectively the first option
    from below referenced discussion.

    More long term solution for LD_ABS|LD_IND instructions will be more
    intrusive but also more beneficial than this, and implemented later
    as it's too risky at this point in time.

    I.e., we plan to look into the option of moving skb_pull() out of
    eth_type_trans() and into netif_receive_skb() as has been suggested
    as second option. Meanwhile, this solution ensures ingress can be
    used with eBPF, too, and that we won't run into ABI troubles later.
    For dealing with negative offsets inside eBPF helper functions,
    we've implemented bpf_skb_clone_unwritable() to test for unwriteable
    headers.

    Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
    Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
    Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

15 Apr, 2015

1 commit


14 Apr, 2015

1 commit

  • Even if we make use of classifier and actions from the egress
    path, we're going into handle_ing() executing additional code
    on a per-packet cost for ingress qdisc, just to realize that
    nothing is attached on ingress.

    Instead, this can just be blinded out as a no-op entirely with
    the use of a static key. On input fast-path, we already make
    use of static keys in various places, e.g. skb time stamping,
    in RPS, etc. It makes sense to not waste time when we're assured
    that no ingress qdisc is attached anywhere.

    Enabling/disabling of that code path is being done via two
    helpers, namely net_{inc,dec}_ingress_queue(), that are being
    invoked under RTNL mutex when a ingress qdisc is being either
    initialized or destructed.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

08 Apr, 2015

1 commit

  • Fixes byte backlog accounting for the first of two chained netem instances.
    Bytes backlog reported now corresponds to the number of queued packets.

    When two netem instances are chained, for instance to apply rate and queue
    limitation followed by packet delay, the number of backlogged bytes reported
    by the first netem instance is wrong. It reports the sum of bytes in the queues
    of the first and second netem. The first netem reports the correct number of
    backlogged packets but not bytes. This is shown in the example below.

    Consider a chain of two netem schedulers created using the following commands:

    $ tc -s qdisc replace dev veth2 root handle 1:0 netem rate 10000kbit limit 100
    $ tc -s qdisc add dev veth2 parent 1:0 handle 2: netem delay 50ms

    Start an iperf session to send packets out on the specified interface and
    monitor the backlog using tc:

    $ tc -s qdisc show dev veth2

    Output using unpatched netem:
    qdisc netem 1: root refcnt 2 limit 100 rate 10000Kbit
    Sent 98422639 bytes 65434 pkt (dropped 123, overlimits 0 requeues 0)
    backlog 172694b 73p requeues 0
    qdisc netem 2: parent 1: limit 1000 delay 50.0ms
    Sent 98422639 bytes 65434 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 63588b 42p requeues 0

    The interface used to produce this output has an MTU of 1500. The output for
    backlogged bytes behind netem 1 is 172694b. This value is not correct. Consider
    the total number of sent bytes and packets. By dividing the number of sent
    bytes by the number of sent packets, we get an average packet size of ~=1504.
    If we divide the number of backlogged bytes by packets, we get ~=2365. This is
    due to the first netem incorrectly counting the 63588b which are in netem 2's
    queue as being in its own queue. To verify this is the case, we subtract them
    from the reported value and divide by the number of packets as follows:
    172694 - 63588 = 109106 bytes actualled backlogged in netem 1
    109106 / 73 packets ~= 1494 bytes (which matches our MTU)

    The root cause is that the byte accounting is not done at the
    same time with packet accounting. The solution is to update the backlog value
    every time the packet queue is updated.

    Signed-off-by: Joseph D Beshay
    Acked-by: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Beshay, Joseph
     

02 Apr, 2015

1 commit


21 Mar, 2015

2 commits

  • This work extends the "classic" BPF programmable tc action by extending
    its scope also to native eBPF code!

    Together with commit e2e9b6541dd4 ("cls_bpf: add initial eBPF support
    for programmable classifiers") this adds the facility to implement fully
    flexible classifier and actions for tc that can be implemented in a C
    subset in user space, "safely" loaded into the kernel, and being run in
    native speed when JITed.

    Also, since eBPF maps can be shared between eBPF programs, it offers the
    possibility that cls_bpf and act_bpf can share data 1) between themselves
    and 2) between user space applications. That means that, f.e. customized
    runtime statistics can be collected in user space, but also more importantly
    classifier and action behaviour could be altered based on map input from
    the user space application.

    For the remaining details on the workflow and integration, see the cls_bpf
    commit e2e9b6541dd4. Preliminary iproute2 part can be found under [1].

    [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf-act

    Signed-off-by: Daniel Borkmann
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Acked-by: Jiri Pirko
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Conflicts:
    drivers/net/ethernet/emulex/benet/be_main.c
    net/core/sysctl_net_core.c
    net/ipv4/inet_diag.c

    The be_main.c conflict resolution was really tricky. The conflict
    hunks generated by GIT were very unhelpful, to say the least. It
    split functions in half and moved them around, when the real actual
    conflict only existed solely inside of one function, that being
    be_map_pci_bars().

    So instead, to resolve this, I checked out be_main.c from the top
    of net-next, then I applied the be_main.c changes from 'net' since
    the last time I merged. And this worked beautifully.

    The inet_diag.c and sysctl_net_core.c conflicts were simple
    overlapping changes, and were easily to resolve.

    Signed-off-by: David S. Miller

    David S. Miller
     

18 Mar, 2015

1 commit

  • Revisiting commit d23b8ad8ab23 ("tc: add BPF based action") with regards
    to eBPF support, I was thinking that it might be better to improve
    return semantics from a BPF program invoked through BPF_PROG_RUN().

    Currently, in case filter_res is 0, we overwrite the default action
    opcode with TC_ACT_SHOT. A default action opcode configured through tc's
    m_bpf can be: TC_ACT_RECLASSIFY, TC_ACT_PIPE, TC_ACT_SHOT, TC_ACT_UNSPEC,
    TC_ACT_OK.

    In cls_bpf, we have the possibility to overwrite the default class
    associated with the classifier in case filter_res is _not_ 0xffffffff
    (-1).

    That allows us to fold multiple [e]BPF programs into a single one, where
    they would otherwise need to be defined as a separate classifier with
    its own classid, needlessly redoing parsing work, etc.

    Similarly, we could do better in act_bpf: Since above TC_ACT* opcodes
    are exported to UAPI anyway, we reuse them for return-code-to-tc-opcode
    mapping, where we would allow above possibilities. Thus, like in cls_bpf,
    a filter_res of 0xffffffff (-1) means that the configured _default_ action
    is used. Any unkown return code from the BPF program would fail in
    tcf_bpf() with TC_ACT_UNSPEC.

    Should we one day want to make use of TC_ACT_STOLEN or TC_ACT_QUEUED,
    which both have the same semantics, we have the option to either use
    that as a default action (filter_res of 0xffffffff) or non-default BPF
    return code.

    All that will allow us to transparently use tcf_bpf() for both BPF
    flavours.

    Signed-off-by: Daniel Borkmann
    Cc: Jiri Pirko
    Cc: Alexei Starovoitov
    Cc: Jamal Hadi Salim
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

13 Mar, 2015

1 commit

  • Currently, it is possible in cls_bpf to access eBPF maps only under
    rcu_read_lock_bh() variants: while on ingress side, that is, handle_ing(),
    the classifier would be called from __netif_receive_skb_core() under
    rcu_read_lock(); on egress side, however, it's rcu_read_lock_bh() via
    __dev_queue_xmit().

    This rcu/rcu_bh mix doesn't work together with eBPF maps as they require
    soley to be called under rcu_read_lock(). eBPF maps could also be shared
    among various other eBPF programs (possibly even with other eBPF program
    types, f.e. tracing) and user space processes, so any context is assumed.

    Therefore, a possible fix for cls_bpf is to wrap/nest eBPF program
    invocation under non-bh RCU lock variant.

    Fixes: e2e9b6541dd4 ("cls_bpf: add initial eBPF support for programmable classifiers")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann