11 Mar, 2018

1 commit

  • [ upstream commit a316338cb71a3260201490e615f2f6d5c0d8fb2c ]

    trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
    attr->map_flags, since it does not support preallocation yet. We
    check the flag, but we never copy the flag into trie->map.map_flags,
    which is later on exposed into fdinfo and used by loaders such as
    iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
    whether a pinned map has the same spec as the one from the BPF obj
    file and if not, bails out, which is currently the case for lpm
    since it exposes always 0 as flags.

    Also copy over flags in array_map_alloc() and stack_map_alloc().
    They always have to be 0 right now, but we should make sure to not
    miss to copy them over at a later point in time when we add actual
    flags for them to use.

    Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
    Reported-by: Jarno Rajahalme
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

05 Jul, 2017

1 commit

  • [ Upstream commit d407bd25a204bd66b7346dde24bd3d37ef0e0b05 ]

    This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
    that are to be used for map allocations. Using kmalloc() for very large
    allocations can cause excessive work within the page allocator, so i) fall
    back earlier to vmalloc() when the attempt is considered costly anyway,
    and even more importantly ii) don't trigger OOM killer with any of the
    allocators.

    Since this is based on a user space request, for example, when creating
    maps with element pre-allocation, we really want such requests to fail
    instead of killing other user space processes.

    Also, don't spam the kernel log with warnings should any of the allocations
    fail under pressure. Given that, we can make backend selection in
    bpf_map_area_alloc() generic, and convert all maps over to use this API
    for spots with potentially large allocation requests.

    Note, replacing the one kmalloc_array() is fine as overflow checks happen
    earlier in htab_map_alloc(), since it must also protect the multiplication
    for vmalloc() should kmalloc_array() fail.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

10 Sep, 2016

1 commit

  • This work adds BPF_CALL_() macros and converts all the eBPF helper functions
    to use them, in a similar fashion like we do with SYSCALL_DEFINE() macros
    that are used today. Motivation for this is to hide all the register handling
    and all necessary casts from the user, so that it is done automatically in the
    background when adding a BPF_CALL_() call.

    This makes current helpers easier to review, eases to write future helpers,
    avoids getting the casting mess wrong, and allows for extending all helpers at
    once (f.e. build time checks, etc). It also helps detecting more easily in
    code reviews that unused registers are not instrumented in the code by accident,
    breaking compatibility with existing programs.

    BPF_CALL_() internals are quite similar to SYSCALL_DEFINE() ones with some
    fundamental differences, for example, for generating the actual helper function
    that carries all u64 regs, we need to fill unused regs, so that we always end up
    with 5 u64 regs as an argument.

    I reviewed several 0-5 generated BPF_CALL_() variants of the .i results and
    they look all as expected. No sparse issue spotted. We let this also sit for a
    few days with Fengguang's kbuild test robot, and there were no issues seen. On
    s390, it barked on the "uses dynamic stack allocation" notice, which is an old
    one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
    to the call wrapper, just telling that the perf raw record/frag sits on stack
    (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
    and they were fine as well. All eBPF helpers are now converted to use these
    macros, getting rid of a good chunk of all the raw castings.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

08 Jun, 2016

1 commit


30 May, 2016

1 commit

  • Additionally to being able to control the system wide maximum depth via
    /proc/sys/kernel/perf_event_max_stack, now we are able to ask for
    different depths per event, using perf_event_attr.sample_max_stack for
    that.

    This uses an u16 hole at the end of perf_event_attr, that, when
    perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
    sample_max_stack is zero, means use perf_event_max_stack, otherwise
    it'll be bounds checked under callchain_mutex.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

26 May, 2016

1 commit

  • Pull perf updates from Ingo Molnar:
    "Mostly tooling and PMU driver fixes, but also a number of late updates
    such as the reworking of the call-chain size limiting logic to make
    call-graph recording more robust, plus tooling side changes for the
    new 'backwards ring-buffer' extension to the perf ring-buffer"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    perf record: Read from backward ring buffer
    perf record: Rename variable to make code clear
    perf record: Prevent reading invalid data in record__mmap_read
    perf evlist: Add API to pause/resume
    perf trace: Use the ptr->name beautifier as default for "filename" args
    perf trace: Use the fd->name beautifier as default for "fd" args
    perf report: Add srcline_from/to branch sort keys
    perf evsel: Record fd into perf_mmap
    perf evsel: Add overwrite attribute and check write_backward
    perf tools: Set buildid dir under symfs when --symfs is provided
    perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
    perf annotate: Sort list of recognised instructions
    perf annotate: Fix identification of ARM blt and bls instructions
    perf tools: Fix usage of max_stack sysctl
    perf callchain: Stop validating callchains by the max_stack sysctl
    perf trace: Fix exit_group() formatting
    perf top: Use machine->kptr_restrict_warned
    perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
    perf machine: Do not bail out if not managing to read ref reloc symbol
    perf/x86/intel/p4: Trival indentation fix, remove space
    ...

    Linus Torvalds
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

17 May, 2016

1 commit

  • This makes perf_callchain_{user,kernel}() receive the max stack
    as context for the perf_callchain_entry, instead of accessing
    the global sysctl_perf_event_max_stack.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

27 Apr, 2016

1 commit

  • The default remains 127, which is good for most cases, and not even hit
    most of the time, but then for some cases, as reported by Brendan, 1024+
    deep frames are appearing on the radar for things like groovy, ruby.

    And in some workloads putting a _lower_ cap on this may make sense. One
    that is per event still needs to be put in place tho.

    The new file is:

    # cat /proc/sys/kernel/perf_event_max_stack
    127

    Chaging it:

    # echo 256 > /proc/sys/kernel/perf_event_max_stack
    # cat /proc/sys/kernel/perf_event_max_stack
    256

    But as soon as there is some event using callchains we get:

    # echo 512 > /proc/sys/kernel/perf_event_max_stack
    -bash: echo: write error: Device or resource busy
    #

    Because we only allocate the callchain percpu data structures when there
    is a user, which allows for changing the max easily, its just a matter
    of having no callchain users at that point.

    Reported-and-Tested-by: Brendan Gregg
    Reviewed-by: Frederic Weisbecker
    Acked-by: Alexei Starovoitov
    Acked-by: David Ahern
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

08 Apr, 2016

1 commit


09 Mar, 2016

2 commits

  • It was observed that calling bpf_get_stackid() from a kprobe inside
    slub or from spin_unlock causes similar deadlock as with hashmap,
    therefore convert stackmap to use pre-allocated memory.

    The call_rcu is no longer feasible mechanism, since delayed freeing
    causes bpf_get_stackid() to fail unpredictably when number of actual
    stacks is significantly less than user requested max_entries.
    Since elements are no longer freed into slub, we can push elements into
    freelist immediately and let them be recycled.
    However the very unlikley race between user space map_lookup() and
    program-side recycling is possible:
    cpu0 cpu1
    ---- ----
    user does lookup(stackidX)
    starts copying ips into buffer
    delete(stackidX)
    calls bpf_get_stackid()
    which recyles the element and
    overwrites with new stack trace

    To avoid user space seeing a partial stack trace consisting of two
    merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
    to preserve consistent stack trace delivery to user space.
    Now we can move memset(,0) of left-over element value from critical
    path of bpf_get_stackid() into slow-path of user space lookup.
    Also disallow lookup() from bpf program, since it's useless and
    program shouldn't be messing with collected stack trace.

    Note that similar race between user space lookup and kernel side updates
    is also present in hashmap, but it's not a new race. bpf programs were
    always allowed to modify hash and array map elements while user space
    is copying them.

    Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Suggested-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

20 Feb, 2016

1 commit

  • add new map type to store stack traces and corresponding helper
    bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
    @ctx: struct pt_regs*
    @map: pointer to stack_trace map
    @flags: bits 0-7 - numer of stack frames to skip
    bit 8 - collect user stack instead of kernel
    bit 9 - compare stacks by hash only
    bit 10 - if two different stacks hash into the same stackid
    discard old
    other bits - reserved
    Return: >= 0 stackid on success or negative error

    stackid is a 32-bit integer handle that can be further combined with
    other data (including other stackid) and used as a key into maps.

    Userspace will access stackmap using standard lookup/delete syscall commands to
    retrieve full stack trace for given stackid.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov