01 Sep, 2020

1 commit

  • The functions bq_enqueue(), bq_flush_to_queue(), and bq_xmit_all() in
    {cpu,dev}map.c always return zero. Changing the return type from int
    to void makes the code easier to follow.

    Suggested-by: David Ahern
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200901083928.6199-1-bjorn.topel@gmail.com

    Björn Töpel
     

28 Aug, 2020

1 commit

  • Some properties of the inner map is used in the verification time.
    When an inner map is inserted to an outer map at runtime,
    bpf_map_meta_equal() is currently used to ensure those properties
    of the inserting inner map stays the same as the verification
    time.

    In particular, the current bpf_map_meta_equal() checks max_entries which
    turns out to be too restrictive for most of the maps which do not use
    max_entries during the verification time. It limits the use case that
    wants to replace a smaller inner map with a larger inner map. There are
    some maps do use max_entries during verification though. For example,
    the map_gen_lookup in array_map_ops uses the max_entries to generate
    the inline lookup code.

    To accommodate differences between maps, the map_meta_equal is added
    to bpf_map_ops. Each map-type can decide what to check when its
    map is used as an inner map during runtime.

    Also, some map types cannot be used as an inner map and they are
    currently black listed in bpf_map_meta_alloc() in map_in_map.c.
    It is not unusual that the new map types may not aware that such
    blacklist exists. This patch enforces an explicit opt-in
    and only allows a map to be used as an inner map if it has
    implemented the map_meta_equal ops. It is based on the
    discussion in [1].

    All maps that support inner map has its map_meta_equal points
    to bpf_map_meta_equal in this patch. A later patch will
    relax the max_entries check for most maps. bpf_types.h
    counts 28 map types. This patch adds 23 ".map_meta_equal"
    by using coccinelle. -5 for
    BPF_MAP_TYPE_PROG_ARRAY
    BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
    BPF_MAP_TYPE_STRUCT_OPS
    BPF_MAP_TYPE_ARRAY_OF_MAPS
    BPF_MAP_TYPE_HASH_OF_MAPS

    The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
    is moved such that the same error is returned.

    [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com

    Martin KaFai Lau
     

05 Jul, 2020

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2020-07-04

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 73 non-merge commits during the last 17 day(s) which contain
    a total of 106 files changed, 5233 insertions(+), 1283 deletions(-).

    The main changes are:

    1) bpftool ability to show PIDs of processes having open file descriptors
    for BPF map/program/link/BTF objects, relying on BPF iterator progs
    to extract this info efficiently, from Andrii Nakryiko.

    2) Addition of BPF iterator progs for dumping TCP and UDP sockets to
    seq_files, from Yonghong Song.

    3) Support access to BPF map fields in struct bpf_map from programs
    through BTF struct access, from Andrey Ignatov.

    4) Add a bpf_get_task_stack() helper to be able to dump /proc/*/stack
    via seq_file from BPF iterator progs, from Song Liu.

    5) Make SO_KEEPALIVE and related options available to bpf_setsockopt()
    helper, from Dmitry Yakunin.

    6) Optimize BPF sk_storage selection of its caching index, from Martin
    KaFai Lau.

    7) Removal of redundant synchronize_rcu()s from BPF map destruction which
    has been a historic leftover, from Alexei Starovoitov.

    8) Several improvements to test_progs to make it easier to create a shell
    loop that invokes each test individually which is useful for some CIs,
    from Jesper Dangaard Brouer.

    9) Fix bpftool prog dump segfault when compiled without skeleton code on
    older clang versions, from John Fastabend.

    10) Bunch of cleanups and minor improvements, from various others.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Jun, 2020

1 commit

  • Set map_btf_name and map_btf_id for all map types so that map fields can
    be accessed by bpf programs.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com

    Andrey Ignatov
     

18 Jun, 2020

1 commit

  • Syzkaller discovered that creating a hash of type devmap_hash with a large
    number of entries can hit the memory allocator limit for allocating
    contiguous memory regions. There's really no reason to use kmalloc_array()
    directly in the devmap code, so just switch it to the existing
    bpf_map_area_alloc() function that is used elsewhere.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-by: Xiumei Mu
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200616142829.114173-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

10 Jun, 2020

2 commits

  • V2:
    - Defer changing BPF-syscall to start at file-descriptor 1
    - Use {} to zero initialise struct.

    The recent commit fbee97feed9b ("bpf: Add support to attach bpf program to a
    devmap entry"), introduced ability to attach (and run) a separate XDP
    bpf_prog for each devmap entry. A bpf_prog is added via a file-descriptor.
    As zero were a valid FD, not using the feature requires using value minus-1.
    The UAPI is extended via tail-extending struct bpf_devmap_val and using
    map->value_size to determine the feature set.

    This will break older userspace applications not using the bpf_prog feature.
    Consider an old userspace app that is compiled against newer kernel
    uapi/bpf.h, it will not know that it need to initialise the member
    bpf_prog.fd to minus-1. Thus, users will be forced to update source code to
    get program running on newer kernels.

    This patch remove the minus-1 checks, and have zero mean feature isn't used.

    Followup patches either for kernel or libbpf should handle and avoid
    returning file-descriptor zero in the first place.

    Fixes: fbee97feed9b ("bpf: Add support to attach bpf program to a devmap entry")
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/159170950687.2102545.7235914718298050113.stgit@firesoul

    Jesper Dangaard Brouer
     
  • This is a new context that does not handle metadata at the moment, so
    mark data_meta invalid.

    Fixes: fbee97feed9b ("bpf: Add support to attach bpf program to a devmap entry")
    Signed-off-by: David Ahern
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200608151723.9539-1-dsahern@kernel.org

    David Ahern
     

02 Jun, 2020

4 commits

  • In order to use standard 'xdp' prefix, rename convert_to_xdp_frame
    utility routine in xdp_convert_buff_to_frame and replace all the
    occurrences

    Signed-off-by: Lorenzo Bianconi
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jesper Dangaard Brouer
    Link: https://lore.kernel.org/bpf/6344f739be0d1a08ab2b9607584c4d5478c8c083.1590698295.git.lorenzo@kernel.org

    Lorenzo Bianconi
     
  • Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
    moment only the device is added. Other fields (queue_index)
    can be added as use cases arise.

    >From a UAPI perspective, add egress_ifindex to xdp context for
    bpf programs to see the Tx device.

    Update the verifier to only allow accesses to egress_ifindex by
    XDP programs with BPF_XDP_DEVMAP expected attach type.

    Signed-off-by: David Ahern
    Signed-off-by: Alexei Starovoitov
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org
    Signed-off-by: Alexei Starovoitov

    David Ahern
     
  • Add BPF_XDP_DEVMAP attach type for use with programs associated with a
    DEVMAP entry.

    Allow DEVMAPs to associate a program with a device entry by adding
    a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
    id, so the fd and id are a union. bpf programs can get access to the
    struct via vmlinux.h.

    The program associated with the fd must have type XDP with expected
    attach type BPF_XDP_DEVMAP. When a program is associated with a device
    index, the program is run on an XDP_REDIRECT and before the buffer is
    added to the per-cpu queue. At this point rxq data is still valid; the
    next patch adds tx device information allowing the prorgam to see both
    ingress and egress device indices.

    XDP generic is skb based and XDP programs do not work with skb's. Block
    the use case by walking maps used by a program that is to be attached
    via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with

    Block attach of BPF_XDP_DEVMAP programs to devices.

    Signed-off-by: David Ahern
    Signed-off-by: Alexei Starovoitov
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.org
    Signed-off-by: Alexei Starovoitov

    David Ahern
     
  • Add 'struct bpf_devmap_val' to formalize the expected values that can
    be passed in for a DEVMAP. Update devmap code to use the struct.

    Signed-off-by: David Ahern
    Signed-off-by: Alexei Starovoitov
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200529220716.75383-2-dsahern@kernel.org
    Signed-off-by: Alexei Starovoitov

    David Ahern
     

23 Apr, 2020

1 commit

  • Export the DEV_MAP_BULK_SIZE macro to the header file so that drivers
    can directly use it as the maximum number of xdp_frames received in the
    .ndo_xdp_xmit() callback.

    Signed-off-by: Ioana Ciornei
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Ioana Ciornei
     

27 Jan, 2020

2 commits

  • Now that we depend on rcu_call() and synchronize_rcu() to also wait
    for preempt_disabled region to complete the rcu read critical section
    in __dev_map_flush() is no longer required. Except in a few special
    cases in drivers that need it for other reasons.

    These originally ensured the map reference was safe while a map was
    also being free'd. And additionally that bpf program updates via
    ndo_bpf did not happen while flush updates were in flight. But flush
    by new rules can only be called from preempt-disabled NAPI context.
    The synchronize_rcu from the map free path and the rcu_call from the
    delete path will ensure the reference there is safe. So lets remove
    the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
    around how this is being protected.

    If the rcu_read_lock was required it would mean errors in the above
    logic and the original patch would also be wrong.

    Now that we have done above we put the rcu_read_lock in the driver
    code where it is needed in a driver dependent way. I think this
    helps readability of the code so we know where and why we are
    taking read locks. Most drivers will not need rcu_read_locks here
    and further XDP drivers already have rcu_read_locks in their code
    paths for reading xdp programs on RX side so this makes it symmetric
    where we don't have half of rcu critical sections define in driver
    and the other half in devmap.

    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Link: https://lore.kernel.org/bpf/1580084042-11598-4-git-send-email-john.fastabend@gmail.com

    John Fastabend
     
  • Now that we rely on synchronize_rcu and call_rcu waiting to
    exit perempt-disable regions (NAPI) lets update the comments
    to reflect this.

    Fixes: 0536b85239b84 ("xdp: Simplify devmap cleanup")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/1580084042-11598-2-git-send-email-john.fastabend@gmail.com

    John Fastabend
     

24 Jan, 2020

1 commit

  • head is traversed using hlist_for_each_entry_rcu outside an RCU
    read-side critical section but under the protection of dtab->index_lock.

    Hence, add corresponding lockdep expression to silence false-positive
    lockdep warnings, and harden RCU lists.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Signed-off-by: Amol Grover
    Signed-off-by: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200123120437.26506-1-frextrite@gmail.com

    Amol Grover
     

17 Jan, 2020

3 commits

  • Now that we don't have a reference to a devmap when flushing the device
    bulk queue, let's change the the devmap_xmit tracepoint to remote the
    map_id and map_index fields entirely. Rearrange the fields so 'drops' and
    'sent' stay in the same position in the tracepoint struct, to make it
    possible for the xdp_monitor utility to read both the old and the new
    format.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/157918768613.1458396.9165902403373826572.stgit@toke.dk

    Jesper Dangaard Brouer
     
  • Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
    we can re-use the bulking for the non-map version of the bpf_redirect()
    helper. This is a simple matter of having xdp_do_redirect_slow() queue the
    frame on the bulk queue instead of sending it out with __bpf_tx_xdp().

    Unfortunately we can't make the bpf_redirect() helper return an error if
    the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
    have a reference to the network namespace of the ingress device at the time
    the helper is called. So we have to leave it as-is and keep the device
    lookup in xdp_do_redirect_slow().

    Since this leaves less reason to have the non-map redirect code in a
    separate function, so we get rid of the xdp_do_redirect_slow() function
    entirely. This does lose us the tracepoint disambiguation, but fortunately
    the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
    entry structures. This means both can contain a map index, so we can just
    amend the tracepoint definitions so we always emit the xdp_redirect(_err)
    tracepoints, but with the map ID only populated if a map is present. This
    means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
    the definitions around in case someone is still listening for them.

    With this change, the performance of the xdp_redirect sample program goes
    from 5Mpps to 8.4Mpps (a 68% increase).

    Since the flush functions are no longer map-specific, rename the flush()
    functions to drop _map from their names. One of the renamed functions is
    the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
    keep from having to update all drivers, use a #define to keep the old name
    working, and only update the virtual drivers in this patch.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
    instances"), changed devmap flushing to be a global operation instead of a
    per-map operation. However, the queue structure used for bulking was still
    allocated as part of the containing map.

    This patch moves the devmap bulk queue into struct net_device. The
    motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
    which will be changed in a subsequent commit. To avoid other fields of
    struct net_device moving to different cache lines, we also move a couple of
    other members around.

    We defer the actual allocation of the bulk queue structure until the
    NETDEV_REGISTER notification devmap.c. This makes it possible to check for
    ndo_xdp_xmit support before allocating the structure, which is not possible
    at the time struct net_device is allocated. However, we keep the freeing in
    free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.

    Because of this change, we lose the reference back to the map that
    originated the redirect, so change the tracepoint to always return 0 as the
    map ID and index. Otherwise no functional change is intended with this
    patch.

    After this patch, the relevant part of struct net_device looks like this,
    according to pahole:

    /* --- cacheline 14 boundary (896 bytes) --- */
    struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
    unsigned int num_tx_queues; /* 904 4 */
    unsigned int real_num_tx_queues; /* 908 4 */
    struct Qdisc * qdisc; /* 912 8 */
    unsigned int tx_queue_len; /* 920 4 */
    spinlock_t tx_global_lock; /* 924 4 */
    struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
    struct xps_dev_maps * xps_cpus_map; /* 936 8 */
    struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
    struct mini_Qdisc * miniq_egress; /* 952 8 */
    /* --- cacheline 15 boundary (960 bytes) --- */
    struct hlist_head qdisc_hash[16]; /* 960 128 */
    /* --- cacheline 17 boundary (1088 bytes) --- */
    struct timer_list watchdog_timer; /* 1088 40 */

    /* XXX last struct has 4 bytes of padding */

    int watchdog_timeo; /* 1128 4 */

    /* XXX 4 bytes hole, try to pack */

    struct list_head todo_list; /* 1136 16 */
    /* --- cacheline 18 boundary (1152 bytes) --- */

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Björn Töpel
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk

    Toke Høiland-Jørgensen
     

20 Dec, 2019

2 commits

  • The devmap flush list is used to track entries that need to flushed
    from via the xdp_do_flush_map() function. This list used to be
    per-map, but there is really no reason for that. Instead make the
    flush list global for all devmaps, which simplifies __dev_map_flush()
    and dev_map_init_map().

    Signed-off-by: Björn Töpel
    Signed-off-by: Alexei Starovoitov
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20191219061006.21980-6-bjorn.topel@gmail.com

    Björn Töpel
     
  • After the RCU flavor consolidation [1], call_rcu() and
    synchronize_rcu() waits for preempt-disable regions (NAPI) in addition
    to the read-side critical sections. As a result of this, the cleanup
    code in devmap can be simplified

    * There is no longer a need to flush in __dev_map_entry_free, since we
    know that this has been done when the call_rcu() callback is
    triggered.

    * When freeing the map, there is no need to explicitly wait for a
    flush. It's guaranteed to be done after the synchronize_rcu() call
    in dev_map_free(). The rcu_barrier() is still needed, so that the
    map is not freed prior the elements.

    [1] https://lwn.net/Articles/777036/

    Signed-off-by: Björn Töpel
    Signed-off-by: Alexei Starovoitov
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20191219061006.21980-2-bjorn.topel@gmail.com

    Björn Töpel
     

25 Nov, 2019

1 commit

  • Tetsuo pointed out that it was not only the device unregister hook that was
    broken for devmap_hash types, it was also cleanup on map free. So better
    fix this as well.

    While we're at it, there's no reason to allocate the netdev_map array for
    DEVMAP_HASH, so skip that and adjust the cost accordingly.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-by: Tetsuo Handa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20191121133612.430414-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

22 Oct, 2019

1 commit

  • It seems I forgot to add handling of devmap_hash type maps to the device
    unregister hook for devmaps. This omission causes devices to not be
    properly released, which causes hangs.

    Fix this by adding the missing handler.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-by: Tetsuo Handa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20191019111931.2981954-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

19 Oct, 2019

1 commit

  • Tetsuo pointed out that without an explicit cast, the cost calculation for
    devmap_hash type maps could overflow on 32-bit builds. This adds the
    missing cast.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-by: Tetsuo Handa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20191017105702.2807093-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

16 Sep, 2019

1 commit

  • syzbot found a crash in dev_map_hash_update_elem(), when replacing an
    element with a new one. Jesper correctly identified the cause of the crash
    as a race condition between the initial lookup in the map (which is done
    before taking the lock), and the removal of the old element.

    Rather than just add a second lookup into the hashmap after taking the
    lock, fix this by reworking the function logic to take the lock before the
    initial lookup.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-and-tested-by: syzbot+4e7a85b1432052e8d6f8@syzkaller.appspotmail.com
    Signed-off-by: Toke Høiland-Jørgensen
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Toke Høiland-Jørgensen
     

30 Jul, 2019

2 commits

  • A common pattern when using xdp_redirect_map() is to create a device map
    where the lookup key is simply ifindex. Because device maps are arrays,
    this leaves holes in the map, and the map has to be sized to fit the
    largest ifindex, regardless of how many devices actually are actually
    needed in the map.

    This patch adds a second type of device map where the key is looked up
    using a hashmap, instead of being used as an array index. This allows maps
    to be densely packed, so they can be smaller.

    Signed-off-by: Toke Høiland-Jørgensen
    Acked-by: Yonghong Song
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Alexei Starovoitov

    Toke Høiland-Jørgensen
     
  • The subsequent patch to add a new devmap sub-type can re-use much of the
    initialisation and allocation code, so refactor it into separate functions.

    Signed-off-by: Toke Høiland-Jørgensen
    Acked-by: Yonghong Song
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Alexei Starovoitov

    Toke Høiland-Jørgensen
     

29 Jun, 2019

2 commits

  • We don't currently allow lookups into a devmap from eBPF, because the map
    lookup returns a pointer directly to the dev->ifindex, which shouldn't be
    modifiable from eBPF.

    However, being able to do lookups in devmaps is useful to know (e.g.)
    whether forwarding to a specific interface is enabled. Currently, programs
    work around this by keeping a shadow map of another type which indicates
    whether a map index is valid.

    Since we now have a flag to make maps read-only from the eBPF side, we can
    simply lift the lookup restriction if we make sure this flag is always set.

    Signed-off-by: Toke Høiland-Jørgensen
    Acked-by: Jonathan Lemon
    Acked-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann

    Toke Høiland-Jørgensen
     
  • The socket map uses a linked list instead of a bitmap to keep track of
    which entries to flush. Do the same for devmap and cpumap, as this means we
    don't have to care about the map index when enqueueing things into the
    map (and so we can cache the map lookup).

    Signed-off-by: Toke Høiland-Jørgensen
    Acked-by: Jonathan Lemon
    Acked-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann

    Toke Høiland-Jørgensen
     

20 Jun, 2019

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2019-06-19

    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin.

    2) BTF based map definition, from Andrii.

    3) support bpf_map_lookup_elem for xskmap, from Jonathan.

    4) bounded loops and scalar precision logic in the verifier, from Alexei.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

18 Jun, 2019

2 commits

  • Honestly all the conflicts were simple overlapping changes,
    nothing really interesting to report.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:
    "Lots of bug fixes here:

    1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

    2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
    Crispin.

    3) Use after free in psock backlog workqueue, from John Fastabend.

    4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
    Salem.

    5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

    6) Network header needs to be set for packet redirect in nfp, from
    John Hurley.

    7) Fix udp zerocopy refcnt, from Willem de Bruijn.

    8) Don't assume linear buffers in vxlan and geneve error handlers,
    from Stefano Brivio.

    9) Fix TOS matching in mlxsw, from Jiri Pirko.

    10) More SCTP cookie memory leak fixes, from Neil Horman.

    11) Fix VLAN filtering in rtl8366, from Linus Walluij.

    12) Various TCP SACK payload size and fragmentation memory limit fixes
    from Eric Dumazet.

    13) Use after free in pneigh_get_next(), also from Eric Dumazet.

    14) LAPB control block leak fix from Jeremy Sowden"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
    lapb: fixed leak of control-blocks.
    tipc: purge deferredq list for each grp member in tipc_group_delete
    ax25: fix inconsistent lock state in ax25_destroy_timer
    neigh: fix use-after-free read in pneigh_get_next
    tcp: fix compile error if !CONFIG_SYSCTL
    hv_sock: Suppress bogus "may be used uninitialized" warnings
    be2net: Fix number of Rx queues used for flow hashing
    net: handle 802.1P vlan 0 packets properly
    tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
    tcp: add tcp_min_snd_mss sysctl
    tcp: tcp_fragment() should apply sane memory limits
    tcp: limit payload size of sacked skbs
    Revert "net: phylink: set the autoneg state in phylink_phy_change"
    bpf: fix nested bpf tracepoints with per-cpu data
    bpf: Fix out of bounds memory access in bpf_sk_storage
    vsock/virtio: set SOCK_DONE on peer shutdown
    net: dsa: rtl8366: Fix up VLAN filtering
    net: phylink: set the autoneg state in phylink_phy_change
    net: add high_order_alloc_disable sysctl/static key
    tcp: add tcp_tx_skb_cache sysctl
    ...

    Linus Torvalds
     

15 Jun, 2019

3 commits

  • .ndo_xdp_xmit() assumes it is called under RCU. For example virtio_net
    uses RCU to detect it has setup the resources for tx. The assumption
    accidentally broke when introducing bulk queue in devmap.

    Fixes: 5d053f9da431 ("bpf: devmap prepare xdp frames for bulking")
    Reported-by: David Ahern
    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • dev_map_free() forgot to free bulk queue when freeing its entries.

    Fixes: 5d053f9da431 ("bpf: devmap prepare xdp frames for bulking")
    Signed-off-by: Toshiaki Makita
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • dev_map_free() waits for flush_needed bitmap to be empty in order to
    ensure all flush operations have completed before freeing its entries.
    However the corresponding clear_bit() was called before using the
    entries, so the entries could be used after free.

    All access to the entries needs to be done before clearing the bit.
    It seems commit a5e2da6e9787 ("bpf: netdev is never null in
    __dev_map_flush") accidentally changed the clear_bit() and memory access
    order.

    Note that the problem happens only in __dev_map_flush(), not in
    dev_map_flush_old(). dev_map_flush_old() is called only after nulling
    out the corresponding netdev_map entry, so dev_map_free() never frees
    the entry thus no such race happens there.

    Fixes: a5e2da6e9787 ("bpf: netdev is never null in __dev_map_flush")
    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 64 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

04 Jun, 2019

1 commit

  • The variable err is assigned with the value -EINVAL that is never
    read and it is re-assigned a new value later on. The assignment is
    redundant and can be removed.

    Addresses-Coverity: ("Unused value")
    Signed-off-by: Colin Ian King
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Colin Ian King
     

01 Jun, 2019

3 commits

  • Most bpf map types doing similar checks and bytes to pages
    conversion during memory allocation and charging.

    Let's unify these checks by moving them into bpf_map_charge_init().

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • In order to unify the existing memlock charging code with the
    memcg-based memory accounting, which will be added later, let's
    rework the current scheme.

    Currently the following design is used:
    1) .alloc() callback optionally checks if the allocation will likely
    succeed using bpf_map_precharge_memlock()
    2) .alloc() performs actual allocations
    3) .alloc() callback calculates map cost and sets map.memory.pages
    4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
    and performs actual charging; in case of failure the map is
    destroyed

    1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
    performs uncharge and releases the user
    2) .map_free() callback releases the memory

    The scheme can be simplified and made more robust:
    1) .alloc() calculates map cost and calls bpf_map_charge_init()
    2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
    3) .alloc() performs actual allocations

    1) .map_free() callback releases the memory
    2) bpf_map_charge_finish() performs uncharge and releases the user

    The new scheme also allows to reuse bpf_map_charge_init()/finish()
    functions for memcg-based accounting. Because charges are performed
    before actual allocations and uncharges after freeing the memory,
    no bogus memory pressure can be created.

    In cases when the map structure is not available (e.g. it's not
    created yet, or is already destroyed), on-stack bpf_map_memory
    structure is used. The charge can be transferred with the
    bpf_map_charge_move() function.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • Group "user" and "pages" fields of bpf_map into the bpf_map_memory
    structure. Later it can be extended with "memcg" and other related
    information.

    The main reason for a such change (beside cosmetics) is to pass
    bpf_map_memory structure to charging functions before the actual
    allocation of bpf_map.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     

14 May, 2019

1 commit

  • synchronize_rcu() is fine when the rcu callbacks only need
    to free memory (kfree_rcu() or direct kfree() call rcu call backs)

    __dev_map_entry_free() is a bit more complex, so we need to make
    sure that call queued __dev_map_entry_free() callbacks have completed.

    sysbot report:

    BUG: KASAN: use-after-free in dev_map_flush_old kernel/bpf/devmap.c:365
    [inline]
    BUG: KASAN: use-after-free in __dev_map_entry_free+0x2a8/0x300
    kernel/bpf/devmap.c:379
    Read of size 8 at addr ffff8801b8da38c8 by task ksoftirqd/1/18

    CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.17.0+ #39
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1b9/0x294 lib/dump_stack.c:113
    print_address_description+0x6c/0x20b mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:354 [inline]
    kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
    __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
    dev_map_flush_old kernel/bpf/devmap.c:365 [inline]
    __dev_map_entry_free+0x2a8/0x300 kernel/bpf/devmap.c:379
    __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
    rcu_do_batch kernel/rcu/tree.c:2558 [inline]
    invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
    __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
    rcu_process_callbacks+0xe9d/0x1760 kernel/rcu/tree.c:2802
    __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
    run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
    smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
    kthread+0x345/0x410 kernel/kthread.c:240
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

    Allocated by task 6675:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
    kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
    kmalloc include/linux/slab.h:513 [inline]
    kzalloc include/linux/slab.h:706 [inline]
    dev_map_alloc+0x208/0x7f0 kernel/bpf/devmap.c:102
    find_and_alloc_map kernel/bpf/syscall.c:129 [inline]
    map_create+0x393/0x1010 kernel/bpf/syscall.c:453
    __do_sys_bpf kernel/bpf/syscall.c:2351 [inline]
    __se_sys_bpf kernel/bpf/syscall.c:2328 [inline]
    __x64_sys_bpf+0x303/0x510 kernel/bpf/syscall.c:2328
    do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 26:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
    kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
    __cache_free mm/slab.c:3498 [inline]
    kfree+0xd9/0x260 mm/slab.c:3813
    dev_map_free+0x4fa/0x670 kernel/bpf/devmap.c:191
    bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
    process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
    worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
    kthread+0x345/0x410 kernel/kthread.c:240
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

    The buggy address belongs to the object at ffff8801b8da37c0
    which belongs to the cache kmalloc-512 of size 512
    The buggy address is located 264 bytes inside of
    512-byte region [ffff8801b8da37c0, ffff8801b8da39c0)
    The buggy address belongs to the page:
    page:ffffea0006e368c0 count:1 mapcount:0 mapping:ffff8801da800940
    index:0xffff8801b8da3540
    flags: 0x2fffc0000000100(slab)
    raw: 02fffc0000000100 ffffea0007217b88 ffffea0006e30cc8 ffff8801da800940
    raw: ffff8801b8da3540 ffff8801b8da3040 0000000100000004 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8801b8da3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ffff8801b8da3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    > ffff8801b8da3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8801b8da3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8801b8da3980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc

    Fixes: 546ac1ffb70d ("bpf: add devmap, a map for storing net device references")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot+457d3e2ffbcf31aee5c0@syzkaller.appspotmail.com
    Acked-by: Toke Høiland-Jørgensen
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Eric Dumazet