08 Jan, 2014

7 commits

  • Multi-family tables need the AF from the hook ops. Add a pointer to the
    hook ops and replace usage of the hooknum member in struct nft_pktinfo.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • Currently the AF-specific hook functions override the chain-type specific
    hook functions. That doesn't make too much sense since the chain types
    are a special case of the AF-specific hooks.

    Make the AF-specific hook functions the default and make the optional
    chain type hooks override them.

    As a side effect, the necessary code restructuring reduces the code size,
    f.i. in case of nf_tables_ipv4.o:

    nf_tables_ipv4_init_net | -24
    nft_do_chain_ipv4 | -113
    2 functions changed, 137 bytes removed, diff: -137

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • net/netfilter/nft_reject.c: In function 'nft_reject_eval':
    net/netfilter/nft_reject.c:37:14: warning: unused variable 'net' [-Wunused-variable]

    Reported-by: kbuild test robot
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • There are many cases where this feature does not improve performance or even
    reduces it.

    For example, here are the results from tests that I've run using 3.12.6 on one
    Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
    from the Xeon, but they're similar on the i7. All numbers report the
    mean±stddev over 10 runs of 10s.

    1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache
    copy from user on transmit"
    There is no statistically significant difference between tx-nocache-copy
    on/off.
    nic irqs spread out (one queue per cpu)

    200x netperf -r 1400,1
    tx-nocache-copy off
    692000±1000 tps
    50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
    tx-nocache-copy on
    693000±1000 tps
    50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7

    200x netperf -r 14000,14000
    tx-nocache-copy off
    86450±80 tps
    50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
    tx-nocache-copy on
    86110±60 tps
    50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20

    2) single stream throughput tests
    tx-nocache-copy leads to higher service demand

    throughput cpu0 cpu1 demand
    (Gb/s) (Gcycle) (Gcycle) (cycle/B)

    nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)

    tx-nocache-copy off 9402±5 9.4±0.2 0.80±0.01
    tx-nocache-copy on 9403±3 9.85±0.04 0.838±0.004

    nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)

    tx-nocache-copy off 9401±5 5.83±0.03 5.0±0.1 0.923±0.007
    tx-nocache-copy on 9404±2 5.74±0.03 5.523±0.009 0.958±0.002

    As a second example, here are some results from Eric Dumazet with latest
    net-next.
    tx-nocache-copy also leads to higher service demand

    (cpu is Intel(R) Xeon(R) CPU X5660 @ 2.80GHz)

    lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
    lpq83:~# perf stat ./netperf -H lpq84 -c
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
    Recv Send Send Utilization Service Demand
    Socket Socket Message Elapsed Send Recv Send Recv
    Size Size Size Time Throughput local remote local remote
    bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB

    87380 16384 16384 10.00 9407.44 2.50 -1.00 0.522 -1.000

    Performance counter stats for './netperf -H lpq84 -c':

    4282.648396 task-clock # 0.423 CPUs utilized
    9,348 context-switches # 0.002 M/sec
    88 CPU-migrations # 0.021 K/sec
    355 page-faults # 0.083 K/sec
    11,812,797,651 cycles # 2.758 GHz [82.79%]
    9,020,522,817 stalled-cycles-frontend # 76.36% frontend cycles idle [82.54%]
    4,579,889,681 stalled-cycles-backend # 38.77% backend cycles idle [67.33%]
    6,053,172,792 instructions # 0.51 insns per cycle
    # 1.49 stalled cycles per insn [83.64%]
    597,275,583 branches # 139.464 M/sec [83.70%]
    8,960,541 branch-misses # 1.50% of all branches [83.65%]

    10.128990264 seconds time elapsed

    lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
    lpq83:~# perf stat ./netperf -H lpq84 -c
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
    Recv Send Send Utilization Service Demand
    Socket Socket Message Elapsed Send Recv Send Recv
    Size Size Size Time Throughput local remote local remote
    bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB

    87380 16384 16384 10.00 9412.45 2.15 -1.00 0.449 -1.000

    Performance counter stats for './netperf -H lpq84 -c':

    2847.375441 task-clock # 0.281 CPUs utilized
    11,632 context-switches # 0.004 M/sec
    49 CPU-migrations # 0.017 K/sec
    354 page-faults # 0.124 K/sec
    7,646,889,749 cycles # 2.686 GHz [83.34%]
    6,115,050,032 stalled-cycles-frontend # 79.97% frontend cycles idle [83.31%]
    1,726,460,071 stalled-cycles-backend # 22.58% backend cycles idle [66.55%]
    2,079,702,453 instructions # 0.27 insns per cycle
    # 2.94 stalled cycles per insn [83.22%]
    363,773,213 branches # 127.757 M/sec [83.29%]
    4,242,732 branch-misses # 1.17% of all branches [83.51%]

    10.128449949 seconds time elapsed

    CC: Tom Herbert
    Signed-off-by: Benjamin Poirier
    Signed-off-by: David S. Miller

    Benjamin Poirier
     
  • When lo is brought up, new ifa is created. Then, devconf and neigh values
    bitfield should be set so later changes of default values would not
    affect lo values.

    Note that the same behaviour is in ipv6. Also note that this is likely
    not an issue in many distros (for example Fedora 19) because userspace
    sets address to lo manually before bringing it up.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • This change allows to follow a recommandation of RFC4942.

    - Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
    as source addresses for ICMPv6 echo reply. This sysctl is false by default
    to preserve existing behavior.
    - Add inline check ipv6_anycast_destination().
    - Use them in icmpv6_echo_reply().

    Reference:
    RFC4942 - IPv6 Transition/Coexistence Security Considerations
    (http://tools.ietf.org/html/rfc4942#section-2.1.6)

    2.1.6. Anycast Traffic Identification and Security

    [...]
    To avoid exposing knowledge about the internal structure of the
    network, it is recommended that anycast servers now take advantage of
    the ability to return responses with the anycast address as the
    source address if possible.

    Signed-off-by: Francois-Xavier Le Bail
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    FX Le Bail
     
  • Fix to return a negative error code from the error handling
    case instead of 0.

    Fixes: 837052d0ccc5 ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling')
    Signed-off-by: Wei Yongjun
    Signed-off-by: David S. Miller

    Wei Yongjun
     

07 Jan, 2014

33 commits

  • - Replace pr_warn_ratelimited() with net_ratelimit() and netdev_warn().
    - Adjust the algnment of some messages.
    - Remove the peroid.
    - Fix some messages don't have terminating newline.

    Signed-off-by: Hayes Wang
    Signed-off-by: David S. Miller

    Hayes Wang
     
  • > as reported for linux-next of Dec.20, 2013
    > when CONFIG_NEED_DMA_MAP_STATE is not enabled:
    >
    > drivers/net/ethernet/brocade/bna/bnad.c: In function 'bnad_start_xmit':
    > drivers/net/ethernet/brocade/bna/bnad.c:3074:26: error: 'struct bnad_tx_vector' has no member named 'dma_len'

    Reported-by: Randy Dunlap
    Signed-off-by: David S. Miller

    David S. Miller
     
  • GRO/GSO layers can be enabled on a node, even if said
    node is only forwarding packets.

    This patch permits GSO (and upcoming GRO) support for GRE
    encapsulated packets, even if the host has no GRE tunnel setup.

    Signed-off-by: Eric Dumazet
    Cc: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This fixes some typos found by Sergei.

    Reported-by: Sergei Shtylyov
    Signed-off-by: Hauke Mehrtens
    Signed-off-by: David S. Miller

    Hauke Mehrtens
     
  • Jesse Gross says:

    ====================
    [GIT net-next] Open vSwitch

    Open vSwitch changes for net-next/3.14. Highlights are:
    * Performance improvements in the mechanism to get packets to userspace
    using memory mapped netlink and skb zero copy where appropriate.
    * Per-cpu flow stats in situations where flows are likely to be shared
    across CPUs. Standard flow stats are used in other situations to save
    memory and allocation time.
    * A handful of code cleanups and rationalization.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Several functions and datastructures could be local
    Found with 'make namespacecheck'

    Signed-off-by: Stephen Hemminger
    Signed-off-by: Jesse Gross

    Stephen Hemminger
     
  • The copy & csum optimization is no longer present with zerocopy
    enabled. Compute the checksum in skb_gso_segment() directly by
    dropping the HW CSUM capability from the features passed in.

    Signed-off-by: Thomas Graf
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • Use of skb_zerocopy() can avoid the expensive call to memcpy()
    when copying the packet data into the Netlink skb. Completes
    checksum through skb_checksum_help() if not already done in
    GSO segmentation.

    Zerocopy is only performed if user space supported unaligned
    Netlink messages. memory mapped netlink i/o is preferred over
    zerocopy if it is set up.

    Cost of upcall is significantly reduced from:
    + 7.48% vhost-8471 [k] memcpy
    + 5.57% ovs-vswitchd [k] memcpy
    + 2.81% vhost-8471 [k] csum_partial_copy_generic

    to:
    + 5.72% ovs-vswitchd [k] memcpy
    + 3.32% vhost-5153 [k] memcpy
    + 0.68% vhost-5153 [k] skb_zerocopy

    (megaflows disabled)

    Signed-off-by: Thomas Graf
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • Allows removing the net and dp_ifindex argument and simplify the
    code.

    Signed-off-by: Thomas Graf
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • Drop user features if an outdated user space instance that does not
    understand the concept of user_features attempted to create a new
    datapath.

    Signed-off-by: Thomas Graf
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • Signed-off-by: Thomas Graf
    Reviewed-by: Daniel Borkmann
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • Make the skb zerocopy logic written for nfnetlink queue available for
    use by other modules.

    Signed-off-by: Thomas Graf
    Reviewed-by: Daniel Borkmann
    Acked-by: David S. Miller
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • Remove duplicated include.

    Signed-off-by: Wei Yongjun
    Signed-off-by: Jesse Gross

    Wei Yongjun
     
  • As we're only doing a kfree() anyway in the RCU callback, we can
    simply use kfree_rcu, which does the same job, and remove the
    function rcu_free_sw_flow_mask_cb() and rcu_free_acts_callback().

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Jesse Gross

    Daniel Borkmann
     
  • With mega flow implementation ovs flow can be shared between
    multiple CPUs which makes stats updates highly contended
    operation. This patch uses per-CPU stats in cases where a flow
    is likely to be shared (if there is a wildcard in the 5-tuple
    and therefore likely to be spread by RSS). In other situations,
    it uses the current strategy, saving memory and allocation time.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: Jesse Gross

    Pravin B Shelar
     
  • Use memory mapped Netlink i/o for all unicast openvswitch
    communication if a ring has been set up.

    Benchmark
    * pktgen -> ovs internal port
    * 5M pkts, 5M flows
    * 4 threads, 8 cores

    Before:
    Result: OK: 67418743(c67108212+d310530) usec, 5000000 (9000byte,0frags)
    74163pps 5339Mb/sec (5339736000bps) errors: 0
    + 2.98% ovs-vswitchd [k] copy_user_generic_string
    + 2.49% ovs-vswitchd [k] memcpy
    + 1.84% kpktgend_2 [k] memcpy
    + 1.81% kpktgend_1 [k] memcpy
    + 1.81% kpktgend_3 [k] memcpy
    + 1.78% kpktgend_0 [k] memcpy

    After:
    Result: OK: 24229690(c24127165+d102524) usec, 5000000 (9000byte,0frags)
    206358pps 14857Mb/sec (14857776000bps) errors: 0
    + 2.80% ovs-vswitchd [k] memcpy
    + 1.31% kpktgend_2 [k] memcpy
    + 1.23% kpktgend_0 [k] memcpy
    + 1.09% kpktgend_1 [k] memcpy
    + 1.04% kpktgend_3 [k] memcpy
    + 0.96% ovs-vswitchd [k] copy_user_generic_string

    Signed-off-by: Thomas Graf
    Reviewed-by: Daniel Borkmann
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • An insufficent ring frame size configuration can lead to an
    unnecessary skb allocation for every Netlink message. Check frame
    size before taking the queue lock and allocating the skb and
    re-check with lock to be safe.

    Signed-off-by: Thomas Graf
    Reviewed-by: Daniel Borkmann
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • Allocates a new sk_buff large enough to cover the specified payload
    plus required Netlink headers. Will check receiving socket for
    memory mapped i/o capability and use it if enabled. Will fall back
    to non-mapped skb if message size exceeds the frame size of the ring.

    Signed-of-by: Thomas Graf
    Reviewed-by: Daniel Borkmann
    Signed-off-by: Jesse Gross

    Thomas Graf
     
  • Flow lookup can happen either in packet processing context or userspace
    context but it was annotated as requiring RCU read lock to be held. This
    also allows OVS mutex to be held without causing warnings.

    Reported-by: Justin Pettit
    Signed-off-by: Jesse Gross
    Reviewed-by: Thomas Graf

    Jesse Gross
     
  • API changes only for code readability. No functional chnages.

    This patch removes the underscored version. Added a new API
    ovs_flow_tbl_lookup_stats() that returns the n_mask_hits.

    Reported by: Ben Pfaff
    Reviewed-by: Thomas Graf
    Signed-off-by: Andy Zhou
    Signed-off-by: Jesse Gross

    Andy Zhou
     
  • We won't normally have a ton of flow masks but using a size_t to store
    values no bigger than sizeof(struct sw_flow_key) seems excessive.

    This reduces sw_flow_key_range and sw_flow_mask by 4 bytes on 32-bit
    systems. On 64-bit systems it shrinks sw_flow_key_range by 12 bytes but
    sw_flow_mask only by 8 bytes due to padding.

    Compile tested only.

    Signed-off-by: Ben Pfaff
    Acked-by: Andy Zhou
    Signed-off-by: Jesse Gross

    Ben Pfaff
     
  • Signed-off-by: Ben Pfaff
    Signed-off-by: Jesse Gross

    Ben Pfaff
     
  • Conflicts:
    drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
    net/ipv6/ip6_tunnel.c
    net/ipv6/ip6_vti.c

    ipv6 tunnel statistic bug fixes conflicting with consolidation into
    generic sw per-cpu net stats.

    qlogic conflict between queue counting bug fix and the addition
    of multiple MAC address support.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • action flushing missaccounting
    Account only for deleted actions

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • Remove unnecessary checks for act->ops
    (suggested by Eric Dumazet).

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • Use DEVICE_ATTR_RO/RW macros to simplify bridge sysfs attribute definitions.

    Signed-off-by: Scott Feldman
    Signed-off-by: David S. Miller

    sfeldma@cumulusnetworks.com
     
  • br_multicast_set_hash_max() is called from process context in
    net/bridge/br_sysfs_br.c by the sysfs store_hash_max() function.

    br_multicast_set_hash_max() calls spin_lock(&br->multicast_lock),
    which can deadlock the CPU if a softirq that also tries to take the
    same lock interrupts br_multicast_set_hash_max() while the lock is
    held . This can happen quite easily when any of the bridge multicast
    timers expire, which try to take the same lock.

    The fix here is to use spin_lock_bh(), preventing other softirqs from
    executing on this CPU.

    Steps to reproduce:

    1. Create a bridge with several interfaces (I used 4).
    2. Set the "multicast query interval" to a low number, like 2.
    3. Enable the bridge as a multicast querier.
    4. Repeatedly set the bridge hash_max parameter via sysfs.

    # brctl addbr br0
    # brctl addif br0 eth1 eth2 eth3 eth4
    # brctl setmcqi br0 2
    # brctl setmcquerier br0 1

    # while true ; do echo 4096 > /sys/class/net/br0/bridge/hash_max; done

    Signed-off-by: Curt Brune
    Signed-off-by: Scott Feldman
    Signed-off-by: David S. Miller

    Curt Brune
     
  • Sathya Perla posted a patch trying to address following problem :

    The vxlan driver sets itself as the socket owner for all the TX flows
    it encapsulates (using vxlan_set_owner()) and assigns it's own skb
    destructor. This causes all tunneled traffic to land up on only one TXQ
    as all encapsulated skbs refer to the vxlan socket and not the original
    socket. Also, the vxlan skb destructor breaks some functionality for
    tunneled traffic like wmem accounting and as TCP small queues and
    FQ/pacing packet scheduler.

    I reworked Sathya patch and added some explanations.

    vxlan_xmit() can avoid one skb_clone()/dev_kfree_skb() pair
    and gain better drop monitor accuracy, by calling kfree_skb() when
    appropriate.

    The UDP socket used by vxlan to perform encapsulation of xmit packets
    do not need to be alive while packets leave vxlan code. Its better
    to keep original socket ownership to get proper feedback from qdisc and
    NIC layers.

    We use skb->sk to

    A) control amount of bytes/packets queued on behalf of a socket, but
    prior vxlan code did the skb->sk transfert without any limit/control
    on vxlan socket sk_sndbuf.

    B) security purposes (as selinux) or netfilter uses, and I do not think
    anything is prepared to handle vxlan stacked case in this area.

    By not changing ownership, vxlan tunnels behave like other tunnels.
    As Stephen mentioned, we might do the same change in L2TP.

    Reported-by: Sathya Perla
    Signed-off-by: Eric Dumazet
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP out_of_order_queue lock is not used, as queue manipulation
    happens with socket lock held and we therefore use the lockless
    skb queue routines (as __skb_queue_head())

    We can use __skb_queue_head_init() instead of skb_queue_head_init()
    to make this more consistent.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • It does not make sense to create an anycast address for an /128-prefix.
    Suppress it.

    As 32019e651c6fce ("ipv6: Do not leave router anycast address for /127
    prefixes.") shows we also may not leave them, because we could accidentally
    remove an anycast address the user has allocated or got added via another
    prefix.

    Cc: François-Xavier Le Bail
    Cc: Thomas Haller
    Cc: Jiri Pirko
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • The existing serial state notification handling expected older Option
    devices, having a hardcoded assumption that the Modem port was always
    USB interface #2. That isn't true for devices from the past few years.

    hso_serial_state_notification is a local cache of a USB Communications
    Interface Class SERIAL_STATE notification from the device, and the
    USB CDC specification (section 6.3, table 67 "Class-Specific Notifications")
    defines wIndex as the USB interface the event applies to. For hso
    devices this will always be the Modem port, as the Modem port is the
    only port which is set up to receive them by the driver.

    So instead of always expecting USB interface #2, instead validate the
    notification with the actual USB interface number of the Modem port.

    Signed-off-by: Dan Williams
    Tested-by: H. Nikolaus Schaller
    Signed-off-by: David S. Miller

    Dan Williams
     
  • It returns 0 in case of success, !0 error otherwise. Fix the improper error
    verification.

    Fixes: 2c9839c143bbc ("bonding: add num_grat_arp attribute netlink support")
    CC: sfeldma@cumulusnetworks.com
    CC: Jay Vosburgh
    CC: Andy Gospodarek
    Signed-off-by: Veaceslav Falico
    Acked-by: Scott Feldman
    Signed-off-by: David S. Miller

    Veaceslav Falico
     
  • Replace the boolean value with the error code for the return value
    of the rtl_ops_init().

    Signed-off-by: Hayes Wang
    Signed-off-by: David S. Miller

    hayeswang