08 Apr, 2020

1 commit


15 Mar, 2020

1 commit

  • If the AVX2 set is available, we can exploit the repetitive
    characteristic of this algorithm to provide a fast, vectorised
    version by using 256-bit wide AVX2 operations for bucket loads and
    bitwise intersections.

    In most cases, this implementation consistently outperforms rbtree
    set instances despite the fact they are configured to use a given,
    single, ranged data type out of the ones used for performance
    measurements by the nft_concat_range.sh kselftest.

    That script, injecting packets directly on the ingoing device path
    with pktgen, reports, averaged over five runs on a single AMD Epyc
    7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
    CONFIG_RETPOLINE was not set here.

    Note that this is not a fair comparison over hash and rbtree set
    types: non-ranged entries (used to have a reference for hash types)
    would be matched faster than this, and matching on a single field
    only (which is the case for rbtree) is also significantly faster.

    However, it's not possible at the moment to choose this set type
    for non-ranged entries, and the current implementation also needs
    a few minor adjustments in order to match on less than two fields.

    ---------------.-----------------------------------.------------.
    AMD Epyc 7402 | baselines, Mpps | this patch |
    1 thread |___________________________________|____________|
    3.35GHz | | | | | |
    768KiB L1D$ | netdev | hash | rbtree | | |
    ---------------| hook | no | single | | pipapo |
    type entries | drop | ranges | field | pipapo | AVX2 |
    ---------------|--------|--------|--------|--------|------------|
    net,port | | | | | |
    1000 | 19.0 | 10.4 | 3.8 | 4.0 | 7.5 +87% |
    ---------------|--------|--------|--------|--------|------------|
    port,net | | | | | |
    100 | 18.8 | 10.3 | 5.8 | 6.3 | 8.1 +29% |
    ---------------|--------|--------|--------|--------|------------|
    net6,port | | | | | |
    1000 | 16.4 | 7.6 | 1.8 | 2.1 | 4.8 +128% |
    ---------------|--------|--------|--------|--------|------------|
    port,proto | | | | | |
    30000 | 19.6 | 11.6 | 3.9 | 0.5 | 2.6 +420% |
    ---------------|--------|--------|--------|--------|------------|
    net6,port,mac | | | | | |
    10 | 16.5 | 5.4 | 4.3 | 3.4 | 4.7 +38% |
    ---------------|--------|--------|--------|--------|------------|
    net6,port,mac, | | | | | |
    proto 1000 | 16.5 | 5.7 | 1.9 | 1.4 | 3.6 +26% |
    ---------------|--------|--------|--------|--------|------------|
    net,mac | | | | | |
    1000 | 19.0 | 8.4 | 3.9 | 2.5 | 6.4 +156% |
    ---------------'--------'--------'--------'--------'------------'

    A similar strategy could be easily reused to implement specialised
    versions for other SIMD sets, and I plan to post at least a NEON
    version at a later time.

    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio