10 Mar, 2016

10 commits

  • Implement kcm_sendpage. Set in sendpage to kcm_sendpage in both
    dgram and seqpacket ops.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Implement kcm_splice_read. This is supported only for seqpacket.
    Add kcm_seqpacket_ops and set splice read to kcm_splice_read.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This patch adds various counters for KCM. These include counters for
    messages and bytes received or sent, as well as counters for number of
    attached/unattached TCP sockets and other error or edge events.

    The statistics are exposed via a proc interface. /proc/net/kcm provides
    statistics per KCM socket and per psock (attached TCP sockets).
    /proc/net/kcm_stats provides aggregate statistics.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This module implements the Kernel Connection Multiplexor.

    Kernel Connection Multiplexor (KCM) is a facility that provides a
    message based interface over TCP for generic application protocols.
    With KCM an application can efficiently send and receive application
    protocol messages over TCP using datagram sockets.

    For more information see the included Documentation/networking/kcm.txt

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Create a common kernel function to get the number of bytes available
    on a TCP socket. This is based on code in INQ getsockopt and we now call
    the function for that getsockopt.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Add walking of fragments in __skb_splice_bits.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
    indicate that more messages will follow (i.e. a batch of messages is
    being sent). This is similar to MSG_MORE except that the following
    messages are not merged into one packet, they are sent individually.
    sendmmsg is updated so that each contained message except for the
    last one is marked as MSG_BATCH.

    MSG_BATCH is a performance optimization in cases where a socket
    implementation can benefit by transmitting packets in a batch.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This patch allows setting MSG_EOR in each individual msghdr passed
    in sendmmsg. This allows a sendmmsg to send multiple messages when
    using SOCK_SEQPACKET.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Export it for cases where we want to create sockets by hand.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This is a convenience function that returns the next entry in an RCU
    list or NULL if at the end of the list.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

09 Mar, 2016

30 commits

  • performance tests for hash map and per-cpu hash map
    with and without pre-allocation

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • increase stress by also calling bpf_get_stackid() from
    various *spin* functions

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • this test calls bpf programs from different contexts:
    from inside of slub, from rcu, from pretty much everywhere,
    since it kprobes all spin_lock functions.
    It stresses the bpf hash and percpu map pre-allocation,
    deallocation logic and call_rcu mechanisms.
    User space part adding more stress by walking and deleting map elements.

    Note that due to nature bpf_load.c the earlier kprobe+bpf programs are
    already active while loader loads new programs, creates new kprobes and
    attaches them.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Helpers like ip_tunnel_info_opts_{get,set}() are only available if
    CONFIG_INET is set, thus add an empty definition into the header for
    the !CONFIG_INET case, where already other empty inline helpers are
    defined.

    This avoids ifdef kludge inside filter.c, but also vxlan and geneve
    themself where this facility can only be used with, depend on INET
    being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
    set in flags.

    Fixes: 14ca0751c96f ("bpf: support for access to tunnel options")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Alexei Starovoitov says:

    ====================
    bpf: map pre-alloc

    v1->v2:
    . fix few issues spotted by Daniel
    . converted stackmap into pre-allocation as well
    . added a workaround for lockdep false positive
    . added pcpu_freelist_populate to be used by hashmap and stackmap

    this path set switches bpf hash map to use pre-allocation by default
    and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
    where full map pre-allocation is too memory expensive.

    Some time back Daniel Wagner reported crashes when bpf hash map is
    used to compute time intervals between preempt_disable->preempt_enable
    and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
    tool if it's used to count the number of invocations of kernel
    '*spin*' functions. Both problems are due to the recursive use of
    slub and can only be solved by pre-allocating all map elements.

    A lot of different solutions were considered. Many implemented,
    but at the end pre-allocation seems to be the only feasible answer.
    As far as pre-allocation goes it also was implemented 4 different ways:
    - simple free-list with single lock
    - percpu_ida with optimizations
    - blk-mq-tag variant customized for bpf use case
    - percpu_freelist
    For bpf style of alloc/free patterns percpu_freelist is the best
    and implemented in this patch set.
    Detailed performance numbers in patch 3.
    Patch 2 introduces percpu_freelist
    Patch 1 fixes simple deadlocks due to missing recursion checks
    Patch 5: converts stackmap to pre-allocation
    Patches 6-9: prepare test infra
    Patch 10: stress test for hash map infra. It attaches to spin_lock
    functions and bpf_map_update/delete are called from different contexts
    Patch 11: stress for bpf_get_stackid
    Patch 12: map performance test

    Reported-by: Daniel Wagner
    Reported-by: Tom Zanussi
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • extend test coveraged to include pre-allocated and run-time alloc maps

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • note old loader is compatible with new kernel.
    map_flags are optional

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • move ksym search from offwaketime into library to be reused
    in other tests

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • map creation is typically the first one to fail when rlimits are
    too low, not enough memory, etc
    Make this failure scenario more verbose

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • It was observed that calling bpf_get_stackid() from a kprobe inside
    slub or from spin_unlock causes similar deadlock as with hashmap,
    therefore convert stackmap to use pre-allocated memory.

    The call_rcu is no longer feasible mechanism, since delayed freeing
    causes bpf_get_stackid() to fail unpredictably when number of actual
    stacks is significantly less than user requested max_entries.
    Since elements are no longer freed into slub, we can push elements into
    freelist immediately and let them be recycled.
    However the very unlikley race between user space map_lookup() and
    program-side recycling is possible:
    cpu0 cpu1
    ---- ----
    user does lookup(stackidX)
    starts copying ips into buffer
    delete(stackidX)
    calls bpf_get_stackid()
    which recyles the element and
    overwrites with new stack trace

    To avoid user space seeing a partial stack trace consisting of two
    merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
    to preserve consistent stack trace delivery to user space.
    Now we can move memset(,0) of left-over element value from critical
    path of bpf_get_stackid() into slow-path of user space lookup.
    Also disallow lookup() from bpf program, since it's useless and
    program shouldn't be messing with collected stack trace.

    Note that similar race between user space lookup and kernel side updates
    is also present in hashmap, but it's not a new race. bpf programs were
    always allowed to modify hash and array map elements while user space
    is copying them.

    Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Suggested-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • If kprobe is placed on spin_unlock then calling kmalloc/kfree from
    bpf programs is not safe, since the following dead lock is possible:
    kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
    bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
    and deadlocks.

    The following solutions were considered and some implemented, but
    eventually discarded
    - kmem_cache_create for every map
    - add recursion check to slow-path of slub
    - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
    - kmalloc via irq_work

    At the end pre-allocation of all map elements turned out to be the simplest
    solution and since the user is charged upfront for all the memory, such
    pre-allocation doesn't affect the user space visible behavior.

    Since it's impossible to tell whether kprobe is triggered in a safe
    location from kmalloc point of view, use pre-allocation by default
    and introduce new BPF_F_NO_PREALLOC flag.

    While testing of per-cpu hash maps it was discovered
    that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
    fails to allocate memory even when 90% of it is free.
    The pre-allocation of per-cpu hash elements solves this problem as well.

    Turned out that bpf_map_update() quickly followed by
    bpf_map_lookup()+bpf_map_delete() is very common pattern used
    in many of iovisor/bcc/tools, so there is additional benefit of
    pre-allocation, since such use cases are must faster.

    Since all hash map elements are now pre-allocated we can remove
    atomic increment of htab->count and save few more cycles.

    Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
    large malloc/free done by users who don't have sufficient limits.

    Pre-allocation is done with vmalloc and alloc/free is done
    via percpu_freelist. Here are performance numbers for different
    pre-allocation algorithms that were implemented, but discarded
    in favor of percpu_freelist:

    1 cpu:
    pcpu_ida 2.1M
    pcpu_ida nolock 2.3M
    bt 2.4M
    kmalloc 1.8M
    hlist+spinlock 2.3M
    pcpu_freelist 2.6M

    4 cpu:
    pcpu_ida 1.5M
    pcpu_ida nolock 1.8M
    bt w/smp_align 1.7M
    bt no/smp_align 1.1M
    kmalloc 0.7M
    hlist+spinlock 0.2M
    pcpu_freelist 2.0M

    8 cpu:
    pcpu_ida 0.7M
    bt w/smp_align 0.8M
    kmalloc 0.4M
    pcpu_freelist 1.5M

    32 cpu:
    kmalloc 0.13M
    pcpu_freelist 0.49M

    pcpu_ida nolock is a modified percpu_ida algorithm without
    percpu_ida_cpu locks and without cross-cpu tag stealing.
    It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

    bt is a variant of block/blk-mq-tag.c simlified and customized
    for bpf use case. bt w/smp_align is using cache line for every 'long'
    (similar to blk-mq-tag). bt no/smp_align allocates 'long'
    bitmasks continuously to save memory. It's comparable to percpu_ida
    and in some cases faster, but slower than percpu_freelist

    hlist+spinlock is the simplest free list with single spinlock.
    As expeceted it has very bad scaling in SMP.

    kmalloc is existing implementation which is still available via
    BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
    in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
    but saves memory, so in cases where map->max_entries can be large
    and number of map update/delete per second is low, it may make
    sense to use it.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Introduce simple percpu_freelist to keep single list of elements
    spread across per-cpu singly linked lists.

    /* push element into the list */
    void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);

    /* pop element from the list */
    struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);

    The object is pushed to the current cpu list.
    Pop first trying to get the object from the current cpu list,
    if it's empty goes to the neigbour cpu list.

    For bpf program usage pattern the collision rate is very low,
    since programs push and pop the objects typically on the same cpu.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • if kprobe is placed within update or delete hash map helpers
    that hold bucket spin lock and triggered bpf program is trying to
    grab the spinlock for the same bucket on the same cpu, it will
    deadlock.
    Fix it by extending existing recursion prevention mechanism.

    Note, map_lookup and other tracing helpers don't have this problem,
    since they don't hold any locks and don't modify global data.
    bpf_trace_printk has its own recursive check and ok as well.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Michal Kubecek says:

    ====================
    ipv6: per netns FIB6 walkers and garbage collector

    Commit 2ac3ac8f86f2 ("ipv6: prevent fib6_run_gc() contention") reduced
    the risk of contention on FIB6 garbage collector lock on systems with
    many CPUs. However, one of our customers can still observe heavy
    contention on fib6_gc_lock which can even trigger the soft lockup
    detector.

    This is caused by garbage collector running in forced mode from a timer.
    While there is one timer per network namespace, the instances of
    fib6_run_gc() running from them are protected by one global spinlock so
    that only one garbage collector can run at any moment and other
    namespaces have to wait. As most relevant data structures are separated
    per netns, there is little reason for garbage collectors blocking each
    other.

    Similar problem exists for walkers: changes in one tree do not need to
    adjust (and block) walkers traversing FIB trees in other namespaces.

    This series separates both the walkers infrastructure and garbage
    collector so that they work independently in network namespaces.

    v2: get rid of ifdef in ipv6_route_seq_setup_walk(), pass net from
    callers instead
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • One of our customers observed issues with FIB6 garbage collectors
    running in different network namespaces blocking each other, resulting
    in soft lockups (fib6_run_gc() initiated from timer runs always in
    forced mode).

    Now that FIB6 walkers are separated per namespace, there is no more need
    for instances of fib6_run_gc() in different namespaces blocking each
    other. There is still a call to icmp6_dst_gc() which operates on shared
    data but this function is protected by its own shared lock.

    Signed-off-by: Michal Kubecek
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • The IPv6 FIB data structures are separated per network namespace but
    there is still only one global walkers list and one global walker list
    lock. This means changes in one namespace unnecessarily interfere with
    walkers in other namespaces.

    Replace the global list with per-netns lists (and give each its own
    lock).

    Signed-off-by: Michal Kubecek
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • Global variable gc_args is only used in fib6_run_gc() and functions
    called from it. As fib6_run_gc() makes sure there is at most one
    instance of fib6_clean_all() running at any moment, we can replace
    gc_args with a local variable which will be needed once multiple
    instances (per netns) of garbage collector are allowed.

    Signed-off-by: Michal Kubecek
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • Michael Chan says:

    ====================
    bnxt_en: Updates for net-next.

    Updates to support autoneg for all supported speeds, add PF port statistics,
    and Advanced Error Reporting.

    v2: Fixed patch 3 to not use parentheses on function return.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Add pci_error_handler callbacks to support for pcie advanced error
    recovery.

    Signed-off-by: Satish Baddipadige
    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Satish Baddipadige
     
  • Include the more useful port statistics in ethtool -S for the PF device.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     
  • Include some of the port error counters (e.g. crc) in ->ndo_get_stats64()
    for the PF device.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     
  • Gather periodic port statistics if the device is PF and link is up. This
    is triggered in bnxt_timer() every one second to request firmware to DMA
    the counters.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     
  • Allow all autoneg speeds aupported by firmware to be advertised. If
    the advertising parameter is 0, then all supported speeds will be
    advertised.

    Remove BNXT_ALL_COPPER_ETHTOOL_SPEED which is no longer used as all
    supported speeds can be advertised.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     
  • The supported bits and advertising bits in ethtool have the same
    definitions. The same is true for the firmware bits. So use the
    common function to handle the conversion for both supported and
    advertising bits.

    v2: Don't use parentheses on function return.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     
  • And report actual pause settings to ETHTOOL_GPAUSEPARAM to let ethtool
    resolve the actual pause settings.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     
  • Include the conversion of pause bits and add one extra call layer so
    that the same refactored function can be reused to get the link partner
    advertisement bits.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     
  • This fix is for dsmark similar to commit 3557619f0f6f7496ed453d4825e249
    ("net_sched: prio: use qdisc_dequeue_peeked")
    and makes use of qdisc_dequeue_peeked() instead of direct dequeue() call.

    First time, wrr peeks dsmark, which will then peek into sfq.
    sfq dequeues an skb and it's stored in sch->gso_skb.
    Next time, wrr tries to dequeue from dsmark, which will call sfq dequeue
    directly. This results skipping the previously peeked skb.

    So changed dsmark dequeue to call qdisc_dequeue_peeked() instead to use
    peeked skb if exists.

    Signed-off-by: Kyeong Yoo
    Signed-off-by: David S. Miller

    Kyeong Yoo
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter updates for your net-next tree,
    they are:

    1) Remove useless debug message when deleting IPVS service, from
    Yannick Brosseau.

    2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
    several spots of the IPVS code, from Arnd Bergmann.

    3) Add prandom_u32 support to nft_meta, from Florian Westphal.

    4) Remove unused variable in xt_osf, from Sudip Mukherjee.

    5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
    since fixing af_packet defragmentation issues, from Joe Stringer.

    6) On-demand hook registration for iptables from netns. Instead of
    registering the hooks for every available netns whenever we need
    one of the support tables, we register this on the specific netns
    that needs it, patchset from Florian Westphal.

    7) Add missing port range selection to nf_tables masquerading support.

    BTW, just for the record, there is a typo in the description of
    5f6c253ebe93b0 ("netfilter: bridge: register hooks only when bridge
    interface is added") that refers to the cluster match as deprecated, but
    it is actually the CLUSTERIP target (which registers hooks
    inconditionally) the one that is scheduled for removal.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Daniel Borkmann says:

    ====================
    BPF updates

    Couple of misc updates to BPF, besides others this series adds
    bpf_csum_diff() to be used with L3 csums, allows for managing
    tunnel options for collect meta data mode, and enabling ipv6
    traffic class for collect meta data in vxlan specifically (geneve
    already supports it). For more details, please see individual
    patches.

    The series requires net to be merged into net-next first to
    avoid any further pending merge conflicts.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller