08 May, 2018

10 commits


07 May, 2018

9 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for your net-next
    tree, more relevant updates in this batch are:

    1) Add Maglev support to IPVS. Moreover, store lastest server weight in
    IPVS since this is needed by maglev, patches from from Inju Song.

    2) Preparation works to add iptables flowtable support, patches
    from Felix Fietkau.

    3) Hand over flows back to conntrack slow path in case of TCP RST/FIN
    packet is seen via new teardown state, also from Felix.

    4) Add support for extended netlink error reporting for nf_tables.

    5) Support for larger timeouts that 23 days in nf_tables, patch from
    Florian Westphal.

    6) Always set an upper limit to dynamic sets, also from Florian.

    7) Allow number generator to make map lookups, from Laura Garcia.

    8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat.

    9) Extend ip6tables SRH match to support previous, next and last SID,
    from Ahmed Abdelsalam.

    10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez.

    11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot.

    12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables,
    from Taehee Yoo.

    13) Unify meta bridge with core nft_meta, then make nft_meta built-in.
    Make rt and exthdr built-in too, again from Florian.

    14) Missing initialization of tbl->entries in IPVS, from Cong Wang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This must now use a 64bit jiffies value, else we set
    a bogus timeout on 32bit.

    Fixes: 8e1102d5a1596 ("netfilter: nf_tables: support timeouts larger than 23 days")
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • IPCTNL_MSG_CT_GET_STATS netlink command allow to monitor current number
    of conntrack entries. However, if one wants to compare it with the
    maximum (and detect exhaustion), the only solution is currently to read
    sysctl value.

    This patch add nf_conntrack_max value in netlink message, and simplify
    monitoring for application built on netlink API.

    Signed-off-by: Florent Fourcot
    Signed-off-by: Pablo Neira Ayuso

    Florent Fourcot
     
  • Add nf_osf_ttl() and nf_osf_match() into nf_osf.c to prepare for
    nf_tables support.

    Signed-off-by: Fernando Fernandez Mancera
    Signed-off-by: Pablo Neira Ayuso

    Fernando Fernandez Mancera
     
  • These macros allow conveniently declaring arrays which use NFT_{RT,CT}_*
    values as indexes.

    Signed-off-by: Phil Sutter
    Signed-off-by: Pablo Neira Ayuso

    Phil Sutter
     
  • Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed
    by SR encapsulated packet. Each SID is encoded as an IPv6 prefix.

    When a Firewall receives an SR encapsulated packet, it should be able
    to identify which node previously processed the packet (previous SID),
    which node is going to process the packet next (next SID), and which
    node is the last to process the packet (last SID) which represent the
    final destination of the packet in case of inline SR mode.

    An example use-case of using these features could be SID list that
    includes two firewalls. When the second firewall receives a packet,
    it can check whether the packet has been processed by the first firewall
    or not. Based on that check, it decides to apply all rules, apply just
    subset of the rules, or totally skip all rules and forward the packet to
    the next SID.

    This patch extends SRH match to support matching previous SID, next SID,
    and last SID.

    Signed-off-by: Ahmed Abdelsalam
    Signed-off-by: Pablo Neira Ayuso

    Ahmed Abdelsalam
     
  • The modulus in the hash function was limited to > 1 as initially
    there was no sense to create a hashing of just one element.

    Nevertheless, there are certain cases specially for load balancing
    where this case needs to be addressed.

    This patch fixes the following error.

    Error: Could not process rule: Numerical result out of range
    add rule ip nftlb lb01 dnat to jhash ip saddr mod 1 map { 0: 192.168.0.10 }
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    The solution comes to force the hash to 0 when the modulus is 1.

    Signed-off-by: Laura Garcia Liebana

    Laura Garcia Liebana
     
  • This patch includes a new attribute in the numgen structure to allow
    the lookup of an element based on the number generator as a key.

    For this purpose, different ops have been included to extend the
    current numgen inc functions.

    Currently, only supported for numgen incremental operations, but
    it will be supported for random in a follow-up patch.

    Signed-off-by: Laura Garcia Liebana
    Signed-off-by: Pablo Neira Ayuso

    Laura Garcia Liebana
     

05 May, 2018

14 commits

  • This slipped through the cracks in the followup set to the fib6_info flip.
    Rename rt6_next to fib6_next.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • If bpf_map_precharge_memlock() did not fail, then we set err to zero.
    However, any subsequent failure from either alloc_percpu() or the
    bpf_map_area_alloc() will return ERR_PTR(0) which in find_and_alloc_map()
    will cause NULL pointer deref.

    In devmap we have the convention that we return -EINVAL on page count
    overflow, so keep the same logic here and just set err to -ENOMEM
    after successful bpf_map_precharge_memlock().

    Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
    Signed-off-by: Daniel Borkmann
    Cc: Björn Töpel
    Acked-by: David S. Miller
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • Jakub Kicinski says:

    ====================
    This series centres on NFP offload of bpf_event_output(). The
    first patch allows perf event arrays to be used by offloaded
    programs. Next patch makes the nfp driver keep track of such
    arrays to be able to filter FW events referring to maps.
    Perf event arrays are not device bound. Having driver
    reimplement and manage the perf array seems brittle and unnecessary.

    Patch 4 moves slightly the verifier step which replaces map fds
    with map pointers. This is useful for nfp JIT since we can then
    easily replace host pointers with NFP table ids (patch 6). This
    allows us to lift the limitation on map helpers having to be used
    with the same map pointer on all paths. Second use of replacing
    fds with real host map pointers is that we can use the host map
    pointer as a key for FW events in perf event array offload.

    Patch 5 adds perf event output offload support for the NFP.

    There are some differences between bpf_event_output() offloaded
    and non-offloaded version. The FW messages which carry events
    may get dropped and reordered relatively easily. The return codes
    from the helper are also not guaranteed to match the host. Users
    are warned about some of those discrepancies with a one time
    warning message to kernel logs.

    bpftool gains an ability to dump perf ring events in a very simple
    format. This was very useful for testing and simple debug, maybe
    it will be useful to others?

    Last patch is a trivial comment fix.
    ====================

    Signed-off-by: Daniel Borkmann

    Daniel Borkmann
     
  • Comments in the verifier refer to free_bpf_prog_info() which
    seems to have never existed in tree. Replace it with
    free_used_maps().

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Users of BPF sooner or later discover perf_event_output() helpers
    and BPF_MAP_TYPE_PERF_EVENT_ARRAY. Dumping this array type is
    not possible, however, we can add simple reading of perf events.
    Create a new event_pipe subcommand for maps, this sub command
    will only work with BPF_MAP_TYPE_PERF_EVENT_ARRAY maps.

    Parts of the code from samples/bpf/trace_output_user.c.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Acked-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Move the get_possible_cpus() function to shared code. No functional
    changes.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Reviewed-by: Jiong Wang
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Instead of spelling [hex] BYTES everywhere use DATA as keyword
    for generalized value. This will help us keep the messages
    concise when longer command are added in the future. It will
    also be useful once BTF support comes. We will only have to
    change the definition of DATA.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Kernel will now replace map fds with actual pointer before
    calling the offload prepare. We can identify those pointers
    and replace them with NFP table IDs instead of loading the
    table ID in code generated for CALL instruction.

    This allows us to support having the same CALL being used with
    different maps.

    Since we don't want to change the FW ABI we still need to
    move the TID from R1 to portion of R0 before the jump.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Reviewed-by: Jiong Wang
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Add support for the perf_event_output family of helpers.

    The implementation on the NFP will not match the host code exactly.
    The state of the host map and rings is unknown to the device, hence
    device can't return errors when rings are not installed. The device
    simply packs the data into a firmware notification message and sends
    it over to the host, returning success to the program.

    There is no notion of a host CPU on the device when packets are being
    processed. Device will only offload programs which set BPF_F_CURRENT_CPU.
    Still, if map index doesn't match CPU no error will be returned (see
    above).

    Dropped/lost firmware notification messages will not cause "lost
    events" event on the perf ring, they are only visible via device
    error counters.

    Firmware notification messages may also get reordered in respect
    to the packets which caused their generation.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Offloads may find host map pointers more useful than map fds.
    Map pointers can be used to identify the map, while fds are
    only valid within the context of loading process.

    Jump to skip_full_check on error in case verifier log overflow
    has to be handled (replace_map_fd_with_map_ptr() prints to the
    log, driver prep may do that too in the future).

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Reviewed-by: Jiong Wang
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • bpf_event_output() is useful for offloads to add events to BPF
    event rings, export it. Note that export is placed near the stub
    since tracing is optional and kernel/bpf/core.c is always going
    to be built.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Reviewed-by: Jiong Wang
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • For asynchronous events originating from the device, like perf event
    output, we need to be able to make sure that objects being referred
    to by the FW message are valid on the host. FW events can get queued
    and reordered. Even if we had a FW message "barrier" we should still
    protect ourselves from bogus FW output.

    Add a reverse-mapping hash table and record in it all raw map pointers
    FW may refer to. Only record neutral maps, i.e. perf event arrays.
    These are currently the only objects FW can refer to. Use RCU protection
    on the read side, update side is under RTNL.

    Since program vs map destruction order is slightly painful for offload
    simply take an extra reference on all the recorded maps to make sure
    they don't disappear.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • BPF_MAP_TYPE_PERF_EVENT_ARRAY is special as far as offload goes.
    The map only holds glue to perf ring, not actual data. Allow
    non-offloaded perf event arrays to be used in offloaded programs.
    Offload driver can extract the events from HW and put them in
    the map for user space to retrieve.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Reviewed-by: Jiong Wang
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Jiong Wang says:

    ====================
    This patch set clean up some code logic related with managing subprog
    information.

    Part of the set are inspried by Edwin's code in his RFC:

    "bpf/verifier: subprog/func_call simplifications"

    but with clearer separation so it could be easier to review.

    - Path 1 unifies main prog and subprogs. All of them are registered in
    env->subprog_starts.

    - After patch 1, it is clear that subprog_starts and subprog_stack_depth
    could be merged as both of them now have main and subprog unified.
    Patch 2 therefore does the merge, all subprog information are centred
    at bpf_subprog_info.

    - Patch 3 goes further to introduce a new fake "exit" subprog which
    serves as an ending marker to the subprog list. We could then turn the
    following code snippets across verifier:

    if (env->subprog_cnt == cur_subprog + 1)
    subprog_end = insn_cnt;
    else
    subprog_end = env->subprog_info[cur_subprog + 1].start;

    into:
    subprog_end = env->subprog_info[cur_subprog + 1].start;

    There is no functional change by this patch set.
    No bpf selftest (both non-jit and jit) regression found after this set.

    v2:
    - fixed adjust_subprog_starts to also update fake "exit" subprog start.
    - for John's suggestion on renaming subprog to prog, I could work on
    a follow-up patch if it is recognized as worth the change.
    ====================

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov

    Daniel Borkmann
     

04 May, 2018

7 commits

  • While trying to support CHECKSUM_COMPLETE for IPV6 fragments,
    I had to experiments various hacks in get_fixed_ipv6_csum().
    I must admit I could not find how to implement this :/

    However, get_fixed_ipv6_csum() does a lot of redundant operations,
    calling csum_partial() twice.

    First csum_partial() computes the checksum of saddr and daddr,
    put in @csum_pseudo_hdr. Undone later in the second csum_partial()
    computed on whole ipv6 header.

    Then nexthdr is added once, added a second time, then substracted.

    payload_len is added once, then substracted.

    Really all this can be reduced to two add_csum(), to add back 6 bytes
    that were removed by mlx4 when providing hw_checksum in RX descriptor.

    Signed-off-by: Eric Dumazet
    Cc: Saeed Mahameed
    Cc: Tariq Toukan
    Reviewed-by: Saeed Mahameed
    Acked-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Ursula Braun says:

    ====================
    net/smc: splice implementation

    Stefan comes up with an smc implementation for splice(). The first
    three patches are preparational patches, the 4th patch implements
    splice().
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Provide an implementation for splice() when we are using SMC. See
    smc_splice_read() for further details.

    Signed-off-by: Stefan Raspl
    Signed-off-by: Ursula Braun <
    Signed-off-by: David S. Miller

    Stefan Raspl
     
  • Preparatory work for splice() support.

    Signed-off-by: Stefan Raspl
    Signed-off-by: Ursula Braun <
    Signed-off-by: David S. Miller

    Stefan Raspl
     
  • Turn smc_rx_wait_data into a generic function that can be used at various
    instances to wait on traffic to complete with varying criteria.

    Signed-off-by: Stefan Raspl
    Signed-off-by: Ursula Braun <
    Signed-off-by: David S. Miller

    Stefan Raspl
     
  • Some of the conditions to exit recv() are common in two pathes - cleaning up
    code by moving the check up so we have it only once.

    Signed-off-by: Stefan Raspl
    Signed-off-by: Ursula Braun <
    Signed-off-by: David S. Miller

    Stefan Raspl
     
  • Overlapping changes in selftests Makefile.

    Signed-off-by: David S. Miller

    David S. Miller