10 Mar, 2016
10 commits
-
Implement kcm_sendpage. Set in sendpage to kcm_sendpage in both
dgram and seqpacket ops.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Implement kcm_splice_read. This is supported only for seqpacket.
Add kcm_seqpacket_ops and set splice read to kcm_splice_read.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
This module implements the Kernel Connection Multiplexor.
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.For more information see the included Documentation/networking/kcm.txt
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Add walking of fragments in __skb_splice_bits.
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.
sendmmsg is updated so that each contained message except for the
last one is marked as MSG_BATCH.MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
This patch allows setting MSG_EOR in each individual msghdr passed
in sendmmsg. This allows a sendmmsg to send multiple messages when
using SOCK_SEQPACKET.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Export it for cases where we want to create sockets by hand.
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
This is a convenience function that returns the next entry in an RCU
list or NULL if at the end of the list.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller
09 Mar, 2016
30 commits
-
performance tests for hash map and per-cpu hash map
with and without pre-allocationSigned-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
increase stress by also calling bpf_get_stackid() from
various *spin* functionsSigned-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
this test calls bpf programs from different contexts:
from inside of slub, from rcu, from pretty much everywhere,
since it kprobes all spin_lock functions.
It stresses the bpf hash and percpu map pre-allocation,
deallocation logic and call_rcu mechanisms.
User space part adding more stress by walking and deleting map elements.Note that due to nature bpf_load.c the earlier kprobe+bpf programs are
already active while loader loads new programs, creates new kprobes and
attaches them.Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
Helpers like ip_tunnel_info_opts_{get,set}() are only available if
CONFIG_INET is set, thus add an empty definition into the header for
the !CONFIG_INET case, where already other empty inline helpers are
defined.This avoids ifdef kludge inside filter.c, but also vxlan and geneve
themself where this facility can only be used with, depend on INET
being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
set in flags.Fixes: 14ca0751c96f ("bpf: support for access to tunnel options")
Reported-by: Fengguang Wu
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
Alexei Starovoitov says:
====================
bpf: map pre-allocv1->v2:
. fix few issues spotted by Daniel
. converted stackmap into pre-allocation as well
. added a workaround for lockdep false positive
. added pcpu_freelist_populate to be used by hashmap and stackmapthis path set switches bpf hash map to use pre-allocation by default
and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
where full map pre-allocation is too memory expensive.Some time back Daniel Wagner reported crashes when bpf hash map is
used to compute time intervals between preempt_disable->preempt_enable
and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
tool if it's used to count the number of invocations of kernel
'*spin*' functions. Both problems are due to the recursive use of
slub and can only be solved by pre-allocating all map elements.A lot of different solutions were considered. Many implemented,
but at the end pre-allocation seems to be the only feasible answer.
As far as pre-allocation goes it also was implemented 4 different ways:
- simple free-list with single lock
- percpu_ida with optimizations
- blk-mq-tag variant customized for bpf use case
- percpu_freelist
For bpf style of alloc/free patterns percpu_freelist is the best
and implemented in this patch set.
Detailed performance numbers in patch 3.
Patch 2 introduces percpu_freelist
Patch 1 fixes simple deadlocks due to missing recursion checks
Patch 5: converts stackmap to pre-allocation
Patches 6-9: prepare test infra
Patch 10: stress test for hash map infra. It attaches to spin_lock
functions and bpf_map_update/delete are called from different contexts
Patch 11: stress for bpf_get_stackid
Patch 12: map performance testReported-by: Daniel Wagner
Reported-by: Tom Zanussi
====================Signed-off-by: David S. Miller
-
extend test coveraged to include pre-allocated and run-time alloc maps
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
note old loader is compatible with new kernel.
map_flags are optionalSigned-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
move ksym search from offwaketime into library to be reused
in other testsSigned-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
map creation is typically the first one to fail when rlimits are
too low, not enough memory, etc
Make this failure scenario more verboseSigned-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
cpu0 cpu1
---- ----
user does lookup(stackidX)
starts copying ips into buffer
delete(stackidX)
calls bpf_get_stackid()
which recyles the element and
overwrites with new stack traceTo avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
Suggested-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_workAt the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49Mpcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelisthlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
Introduce simple percpu_freelist to keep single list of elements
spread across per-cpu singly linked lists./* push element into the list */
void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);/* pop element from the list */
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);The object is pushed to the current cpu list.
Pop first trying to get the object from the current cpu list,
if it's empty goes to the neigbour cpu list.For bpf program usage pattern the collision rate is very low,
since programs push and pop the objects typically on the same cpu.Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller -
Michal Kubecek says:
====================
ipv6: per netns FIB6 walkers and garbage collectorCommit 2ac3ac8f86f2 ("ipv6: prevent fib6_run_gc() contention") reduced
the risk of contention on FIB6 garbage collector lock on systems with
many CPUs. However, one of our customers can still observe heavy
contention on fib6_gc_lock which can even trigger the soft lockup
detector.This is caused by garbage collector running in forced mode from a timer.
While there is one timer per network namespace, the instances of
fib6_run_gc() running from them are protected by one global spinlock so
that only one garbage collector can run at any moment and other
namespaces have to wait. As most relevant data structures are separated
per netns, there is little reason for garbage collectors blocking each
other.Similar problem exists for walkers: changes in one tree do not need to
adjust (and block) walkers traversing FIB trees in other namespaces.This series separates both the walkers infrastructure and garbage
collector so that they work independently in network namespaces.v2: get rid of ifdef in ipv6_route_seq_setup_walk(), pass net from
callers instead
====================Signed-off-by: David S. Miller
-
One of our customers observed issues with FIB6 garbage collectors
running in different network namespaces blocking each other, resulting
in soft lockups (fib6_run_gc() initiated from timer runs always in
forced mode).Now that FIB6 walkers are separated per namespace, there is no more need
for instances of fib6_run_gc() in different namespaces blocking each
other. There is still a call to icmp6_dst_gc() which operates on shared
data but this function is protected by its own shared lock.Signed-off-by: Michal Kubecek
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller -
The IPv6 FIB data structures are separated per network namespace but
there is still only one global walkers list and one global walker list
lock. This means changes in one namespace unnecessarily interfere with
walkers in other namespaces.Replace the global list with per-netns lists (and give each its own
lock).Signed-off-by: Michal Kubecek
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller -
Global variable gc_args is only used in fib6_run_gc() and functions
called from it. As fib6_run_gc() makes sure there is at most one
instance of fib6_clean_all() running at any moment, we can replace
gc_args with a local variable which will be needed once multiple
instances (per netns) of garbage collector are allowed.Signed-off-by: Michal Kubecek
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller -
Michael Chan says:
====================
bnxt_en: Updates for net-next.Updates to support autoneg for all supported speeds, add PF port statistics,
and Advanced Error Reporting.v2: Fixed patch 3 to not use parentheses on function return.
====================Signed-off-by: David S. Miller
-
Add pci_error_handler callbacks to support for pcie advanced error
recovery.Signed-off-by: Satish Baddipadige
Signed-off-by: Michael Chan
Signed-off-by: David S. Miller -
Include the more useful port statistics in ethtool -S for the PF device.
Signed-off-by: Michael Chan
Signed-off-by: David S. Miller -
Include some of the port error counters (e.g. crc) in ->ndo_get_stats64()
for the PF device.Signed-off-by: Michael Chan
Signed-off-by: David S. Miller -
Gather periodic port statistics if the device is PF and link is up. This
is triggered in bnxt_timer() every one second to request firmware to DMA
the counters.Signed-off-by: Michael Chan
Signed-off-by: David S. Miller -
Allow all autoneg speeds aupported by firmware to be advertised. If
the advertising parameter is 0, then all supported speeds will be
advertised.Remove BNXT_ALL_COPPER_ETHTOOL_SPEED which is no longer used as all
supported speeds can be advertised.Signed-off-by: Michael Chan
Signed-off-by: David S. Miller -
The supported bits and advertising bits in ethtool have the same
definitions. The same is true for the firmware bits. So use the
common function to handle the conversion for both supported and
advertising bits.v2: Don't use parentheses on function return.
Signed-off-by: Michael Chan
Signed-off-by: David S. Miller -
And report actual pause settings to ETHTOOL_GPAUSEPARAM to let ethtool
resolve the actual pause settings.Signed-off-by: Michael Chan
Signed-off-by: David S. Miller -
Include the conversion of pause bits and add one extra call layer so
that the same refactored function can be reused to get the link partner
advertisement bits.Signed-off-by: Michael Chan
Signed-off-by: David S. Miller -
This fix is for dsmark similar to commit 3557619f0f6f7496ed453d4825e249
("net_sched: prio: use qdisc_dequeue_peeked")
and makes use of qdisc_dequeue_peeked() instead of direct dequeue() call.First time, wrr peeks dsmark, which will then peek into sfq.
sfq dequeues an skb and it's stored in sch->gso_skb.
Next time, wrr tries to dequeue from dsmark, which will call sfq dequeue
directly. This results skipping the previously peeked skb.So changed dsmark dequeue to call qdisc_dequeue_peeked() instead to use
peeked skb if exists.Signed-off-by: Kyeong Yoo
Signed-off-by: David S. Miller -
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-nextThe following patchset contains Netfilter updates for your net-next tree,
they are:1) Remove useless debug message when deleting IPVS service, from
Yannick Brosseau.2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
several spots of the IPVS code, from Arnd Bergmann.3) Add prandom_u32 support to nft_meta, from Florian Westphal.
4) Remove unused variable in xt_osf, from Sudip Mukherjee.
5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
since fixing af_packet defragmentation issues, from Joe Stringer.6) On-demand hook registration for iptables from netns. Instead of
registering the hooks for every available netns whenever we need
one of the support tables, we register this on the specific netns
that needs it, patchset from Florian Westphal.7) Add missing port range selection to nf_tables masquerading support.
BTW, just for the record, there is a typo in the description of
5f6c253ebe93b0 ("netfilter: bridge: register hooks only when bridge
interface is added") that refers to the cluster match as deprecated, but
it is actually the CLUSTERIP target (which registers hooks
inconditionally) the one that is scheduled for removal.
====================Signed-off-by: David S. Miller
-
Daniel Borkmann says:
====================
BPF updatesCouple of misc updates to BPF, besides others this series adds
bpf_csum_diff() to be used with L3 csums, allows for managing
tunnel options for collect meta data mode, and enabling ipv6
traffic class for collect meta data in vxlan specifically (geneve
already supports it). For more details, please see individual
patches.The series requires net to be merged into net-next first to
avoid any further pending merge conflicts.
====================Signed-off-by: David S. Miller