18 Apr, 2017

1 commit

  • After doing map_perf_test with a much bigger
    BPF_F_NO_COMMON_LRU map, the perf report shows a
    lot of time spent in rotating the inactive list (i.e.
    __bpf_lru_list_rotate_inactive):
    > map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
    19644783 (19M/s)
    > map_perf_test 32 8 10000000 10000000 | awk '{sum += $3}END{print sum}'
    6283930 (6.28M/s)

    By inactive, it usually means the element is not in cache. Hence,
    there is a need to tune the PERCPU_NR_SCANS value.

    This patch finds a better number of elements to
    scan during each list rotation. The PERCPU_NR_SCANS (which
    is defined the same as PERCPU_FREE_TARGET) decreases
    from 16 elements to 4 elements. This change only
    affects the BPF_F_NO_COMMON_LRU map.

    The test_lru_dist does not show meaningful difference
    between 16 and 4. Our production L4 load balancer which uses
    the LRU map for conntrack-ing also shows little change in cache
    hit rate. Since both benchmark and production data show no
    cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
    We can consider making it configurable if we find a usecase
    later that shows another value works better and/or use
    a different rotation strategy.

    After this change:
    > map_perf_test 32 8 10000000 10000000 | awk '{sum += $3}END{print sum}'
    9240324 (9.2M/s)

    i.e. 6.28M/s -> 9.2M/s

    The test_lru_dist has not shown meaningful difference:
    > test_lru_dist zipf.100k.a1_01.out 4000 1:
    nr_misses: 31575 (Before) vs 31566 (After)

    > test_lru_dist zipf.100k.a0_01.out 40000 1
    nr_misses: 67036 (Before) vs 67031 (After)

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

11 Jan, 2017

2 commits

  • Make the functions __local_list_pop_free(), __local_list_pop_pending(),
    bpf_common_lru_populate() and bpf_percpu_lru_populate() static as they
    are not used outide of bpf_lru_list.c

    This fixes the following GCC warnings when building with 'W=1':

    kernel/bpf/bpf_lru_list.c:363:22: warning: no previous prototype for ‘__local_list_pop_free’ [-Wmissing-prototypes]
    kernel/bpf/bpf_lru_list.c:376:22: warning: no previous prototype for ‘__local_list_pop_pending’ [-Wmissing-prototypes]
    kernel/bpf/bpf_lru_list.c:560:6: warning: no previous prototype for ‘bpf_common_lru_populate’ [-Wmissing-prototypes]
    kernel/bpf/bpf_lru_list.c:577:6: warning: no previous prototype for ‘bpf_percpu_lru_populate’ [-Wmissing-prototypes]

    Cc: Martin KaFai Lau
    Signed-off-by: Tobias Klauser
    Acked-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Tobias Klauser
     
  • Remove the unused but set variable 'first_node' in
    __bpf_lru_list_shrink_inactive() to fix the following GCC warning when
    building with 'W=1':

    kernel/bpf/bpf_lru_list.c:216:41: warning: variable ‘first_node’ set but not used [-Wunused-but-set-variable]

    Cc: Martin KaFai Lau
    Signed-off-by: Tobias Klauser
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Tobias Klauser
     

17 Nov, 2016

1 commit

  • gcc-6.2.1 gives the following warning:
    kernel/bpf/bpf_lru_list.c: In function ‘__bpf_lru_list_rotate_inactive.isra.3’:
    kernel/bpf/bpf_lru_list.c:201:28: warning: ‘next’ may be used uninitialized in this function [-Wmaybe-uninitialized]

    The "next" is currently initialized in the while() loop which must have >=1
    iterations.

    This patch initializes next to get rid of the compiler warning.

    Fixes: 3a08c2fd7634 ("bpf: LRU List")
    Reported-by: David Miller
    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

16 Nov, 2016

2 commits

  • Instead of having a common LRU list, this patch allows a
    percpu LRU list which can be selected by specifying a map
    attribute. The map attribute will be added in the later
    patch.

    While the common use case for LRU is #reads >> #updates,
    percpu LRU list allows bpf prog to absorb unusual #updates
    under pathological case (e.g. external traffic facing machine which
    could be under attack).

    Each percpu LRU is isolated from each other. The LRU nodes (including
    free nodes) cannot be moved across different LRU Lists.

    Here are the update performance comparison between
    common LRU list and percpu LRU list (the test code is
    at the last patch):

    [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
    ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
    1 cpus: 2934082 updates
    4 cpus: 7391434 updates
    8 cpus: 6500576 updates

    [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
    ./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
    1 cpus: 2896553 updates
    4 cpus: 9766395 updates
    8 cpus: 17460553 updates

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • Introduce bpf_lru_list which will provide LRU capability to
    the bpf_htab in the later patch.

    * General Thoughts:
    1. Target use case. Read is more often than update.
    (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
    If bpf_prog does a bpf_lookup_elem() first and then an in-place
    update, it still counts as a read operation to the LRU list concern.
    2. It may be useful to think of it as a LRU cache
    3. Optimize the read case
    3.1 No lock in read case
    3.2 The LRU maintenance is only done during bpf_update_elem()
    4. If there is a percpu LRU list, it will lose the system-wise LRU
    property. A completely isolated percpu LRU list has the best
    performance but the memory utilization is not ideal considering
    the work load may be imbalance.
    5. Hence, this patch starts the LRU implementation with a global LRU
    list with batched operations before accessing the global LRU list.
    As a LRU cache, #read >> #update/#insert operations, it will work well.
    6. There is a local list (for each cpu) which is named
    'struct bpf_lru_locallist'. This local list is not used to sort
    the LRU property. Instead, the local list is to batch enough
    operations before acquiring the lock of the global LRU list. More
    details on this later.
    7. In the later patch, it allows a percpu LRU list by specifying a
    map-attribute for scalability reason and for use cases that need to
    prepare for the worst (and pathological) case like DoS attack.
    The percpu LRU list is completely isolated from each other and the
    LRU nodes (including free nodes) cannot be moved across the list. The
    following description is for the global LRU list but mostly applicable
    to the percpu LRU list also.

    * Global LRU List:
    1. It has three sub-lists: active-list, inactive-list and free-list.
    2. The two list idea, active and inactive, is borrowed from the
    page cache.
    3. All nodes are pre-allocated and all sit at the free-list (of the
    global LRU list) at the beginning. The pre-allocation reasoning
    is similar to the existing BPF_MAP_TYPE_HASH. However,
    opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
    the LRU map.

    * Active/Inactive List (of the global LRU list):
    1. The active list, as its name says it, maintains the active set of
    the nodes. We can think of it as the working set or more frequently
    accessed nodes. The access frequency is approximated by a ref-bit.
    The ref-bit is set during the bpf_lookup_elem().
    2. The inactive list, as its name also says it, maintains a less
    active set of nodes. They are the candidates to be removed
    from the bpf_htab when we are running out of free nodes.
    3. The ordering of these two lists is acting as a rough clock.
    The tail of the inactive list is the older nodes and
    should be released first if the bpf_htab needs free element.

    * Rotating the Active/Inactive List (of the global LRU list):
    1. It is the basic operation to maintain the LRU property of
    the global list.
    2. The active list is only rotated when the inactive list is running
    low. This idea is similar to the current page cache.
    Inactive running low is currently defined as
    "# of inactive < # of active".
    3. The active list rotation always starts from the tail. It moves
    node without ref-bit set to the head of the inactive list.
    It moves node with ref-bit set back to the head of the active
    list and then clears its ref-bit.
    4. The inactive rotation is pretty simply.
    It walks the inactive list and moves the nodes back to the head of
    active list if its ref-bit is set. The ref-bit is cleared after moving
    to the active list.
    If the node does not have ref-bit set, it just leave it as it is
    because it is already in the inactive list.

    * Shrinking the Inactive List (of the global LRU list):
    1. Shrinking is the operation to get free nodes when the bpf_htab is
    full.
    2. It usually only shrinks the inactive list to get free nodes.
    3. During shrinking, it will walk the inactive list from the tail,
    delete the nodes without ref-bit set from bpf_htab.
    4. If no free node found after step (3), it will forcefully get
    one node from the tail of inactive or active list. Forcefully is
    in the sense that it ignores the ref-bit.

    * Local List:
    1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to
    batch enough operations before acquiring the lock of the
    global LRU.
    2. A local list has two sub-lists, free-list and pending-list.
    3. During bpf_update_elem(), it will try to get from the free-list
    of (the current CPU local list).
    4. If the local free-list is empty, it will acquire from the
    global LRU list. The global LRU list can either satisfy it
    by its global free-list or by shrinking the global inactive
    list. Since we have acquired the global LRU list lock,
    it will try to get at most LOCAL_FREE_TARGET elements
    to the local free list.
    5. When a new element is added to the bpf_htab, it will
    first sit at the pending-list (of the local list) first.
    The pending-list will be flushed to the global LRU list
    when it needs to acquire free nodes from the global list
    next time.

    * Lock Consideration:
    The LRU list has a lock (lru_lock). Each bucket of htab has a
    lock (buck_lock). If both locks need to be acquired together,
    the lock order is always lru_lock -> buck_lock and this only
    happens in the bpf_lru_list.c logic.

    In hashtab.c, both locks are not acquired together (i.e. one
    lock is always released first before acquiring another lock).

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Martin KaFai Lau