25 Jul, 2018

1 commit


29 May, 2018

1 commit

  • The failover module provides a generic interface for paravirtual drivers
    to register a netdev and a set of ops with a failover instance. The ops
    are used as event handlers that get called to handle netdev register/
    unregister/link change/name change events on slave pci ethernet devices
    with the same mac address as the failover netdev.

    This enables paravirtual drivers to use a VF as an accelerated low latency
    datapath. It also allows migration of VMs with direct attached VFs by
    failing over to the paravirtual datapath when the VF is unplugged.

    Signed-off-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Sridhar Samudrala
     

24 May, 2018

1 commit

  • bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
    and user mode helper code that is embedded into bpfilter.ko

    The steps to build bpfilter.ko are the following:
    - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
    - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
    is converted into bpfilter_umh.o object file
    with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
    Example:
    $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
    0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
    0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
    0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
    - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko

    bpfilter_kern.c is a normal kernel module code that calls
    the fork_usermode_blob() helper to execute part of its own data
    as a user mode process.

    Notice that _binary_net_bpfilter_bpfilter_umh_start - end
    is placed into .init.rodata section, so it's freed as soon as __init
    function of bpfilter.ko is finished.
    As part of __init the bpfilter.ko does first request/reply action
    via two unix pipe provided by fork_usermode_blob() helper to
    make sure that umh is healthy. If not it will kill it via pid.

    Later bpfilter_process_sockopt() will be called from bpfilter hooks
    in get/setsockopt() to pass iptable commands into umh via bpfilter.ko

    If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
    kill umh as well.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

08 May, 2018

1 commit


04 May, 2018

1 commit


01 May, 2018

1 commit


17 Apr, 2018

1 commit

  • Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
    pages on DMA-TX completion time, which have good cross CPU
    performance, given DMA-TX completion time can happen on a remote CPU.

    Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
    Adapted page_pool code to not depend the page allocator and
    integration into struct page. The DMA mapping feature is kept,
    even-though it will not be activated/used in this patchset.

    [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

    V2: Adjustments requested by Tariq
    - Changed page_pool_create return codes, don't return NULL, only
    ERR_PTR, as this simplifies err handling in drivers.

    V4: many small improvements and cleanups
    - Add DOC comment section, that can be used by kernel-doc
    - Improve fallback mode, to work better with refcnt based recycling
    e.g. remove a WARN as pointed out by Tariq
    e.g. quicker fallback if ptr_ring is empty.

    V5: Fixed SPDX license as pointed out by Alexei

    V6: Adjustments requested by Eric Dumazet
    - Adjust ____cacheline_aligned_in_smp usage/placement
    - Move rcu_head in struct page_pool
    - Free pages quicker on destroy, minimize resources delayed an RCU period
    - Remove code for forward/backward compat ABI interface

    V8: Issues found by kbuild test robot
    - Address sparse should be static warnings
    - Only compile+link when a driver use/select page_pool,
    mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

02 Feb, 2018

1 commit

  • Pull staging/IIO updates from Greg KH:
    "Here is the big Staging and IIO driver patches for 4.16-rc1.

    There is the normal amount of new IIO drivers added, like all
    releases.

    The networking IPX and the ncpfs filesystem are moved into the staging
    tree, as they are on their way out of the kernel due to lack of use
    anymore.

    The visorbus subsystem finall has started moving out of the staging
    tree to the "real" part of the kernel, and the most and fsl-mc
    codebases are almost ready to move out, that will probably happen for
    4.17-rc1 if all goes well.

    Other than that, there is a bunch of license header cleanups in the
    tree, along with the normal amount of coding style churn that we all
    know and love for this codebase. I also got frustrated at the
    Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
    huge chunks of it that were never even being used.

    Full details of everything is in the shortlog.

    All of these patches have been in linux-next for a while with no
    reported issues"

    * tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (627 commits)
    staging: rtlwifi: remove redundant initialization of 'cfg_cmd'
    staging: rtl8723bs: remove a couple of redundant initializations
    staging: comedi: reformat lines to 80 chars or less
    staging: lustre: separate a connection destroy from free struct kib_conn
    Staging: rtl8723bs: Use !x instead of NULL comparison
    Staging: rtl8723bs: Remove dead code
    Staging: rtl8723bs: Change names to conform to the kernel code
    staging: ccree: Fix missing blank line after declaration
    staging: rtl8188eu: remove redundant initialization of 'pwrcfgcmd'
    staging: rtlwifi: remove unused RTLHALMAC_ST and RTLPHYDM_ST
    staging: fbtft: remove unused FB_TFT_SSD1325 kconfig
    staging: comedi: dt2811: remove redundant initialization of 'ns'
    staging: wilc1000: fix alignments to match open parenthesis
    staging: wilc1000: removed unnecessary defined enums typedef
    staging: wilc1000: remove unnecessary use of parentheses
    staging: rtl8192u: remove redundant initialization of 'timeout'
    staging: sm750fb: fix CamelCase for dispSet var
    staging: lustre: lnet/selftest: fix compile error on UP build
    staging: rtl8723bs: hal_com_phycfg: Remove unneeded semicolons
    staging: rts5208: Fix "seg_no" calculation in reset_ms_card()
    ...

    Linus Torvalds
     

09 Jan, 2018

1 commit


03 Jan, 2018

1 commit


28 Nov, 2017

1 commit


04 Sep, 2017

1 commit


30 Aug, 2017

1 commit

  • Add a new nsh/ directory. It currently holds only GSO functions but more
    will come: in particular, code shared by openvswitch and tc to manipulate
    NSH headers.

    For now, assume there's no hardware support for NSH segmentation. We can
    always introduce netdev->nsh_features later.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

29 Aug, 2017

2 commits

  • It's time to get rid of IRDA. It's long been broken, and no one seems
    to use it anymore. So move it to staging and after a while, we can
    delete it from there.

    To start, move the network irda core from net/irda to
    drivers/staging/irda/net/

    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: David S. Miller

    Greg Kroah-Hartman
     
  • SOCKMAP uses strparser code (compiled with Kconfig option
    CONFIG_STREAM_PARSER) to run the parser BPF program. Without this
    config option set sockmap wont be compiled. However, at the moment
    the only way to pull in the strparser code is to enable KCM.

    To resolve this create a BPF specific config option to pull
    only the strparser piece in that sockmap needs. This also
    allows folks who want to use BPF/syscall/maps but don't need
    sockmap to easily opt out.

    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

16 Jun, 2017

1 commit

  • Software implementation of transport layer security, implemented using ULP
    infrastructure. tcp proto_ops are replaced with tls equivalents of sendmsg and
    sendpage.

    Only symmetric crypto is done in the kernel, keys are passed by setsockopt
    after the handshake is complete. All control messages are supported via CMSG
    data - the actual symmetric encryption is the same, just the message type needs
    to be passed separately.

    For user API, please see Documentation patch.

    Pieces that can be shared between hw and sw implementation
    are in tls_main.c

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: Aviad Yehezkel
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

18 Feb, 2017

1 commit

  • Long standing issue with JITed programs is that stack traces from
    function tracing check whether a given address is kernel code
    through {__,}kernel_text_address(), which checks for code in core
    kernel, modules and dynamically allocated ftrace trampolines. But
    what is still missing is BPF JITed programs (interpreted programs
    are not an issue as __bpf_prog_run() will be attributed to them),
    thus when a stack trace is triggered, the code walking the stack
    won't see any of the JITed ones. The same for address correlation
    done from user space via reading /proc/kallsyms. This is read by
    tools like perf, but the latter is also useful for permanent live
    tracing with eBPF itself in combination with stack maps when other
    eBPF types are part of the callchain. See offwaketime example on
    dumping stack from a map.

    This work tries to tackle that issue by making the addresses and
    symbols known to the kernel. The lookup from *kernel_text_address()
    is implemented through a latched RB tree that can be read under
    RCU in fast-path that is also shared for symbol/size/offset lookup
    for a specific given address in kallsyms. The slow-path iteration
    through all symbols in the seq file done via RCU list, which holds
    a tiny fraction of all exported ksyms, usually below 0.1 percent.
    Function symbols are exported as bpf_prog_, in order to aide
    debugging and attribution. This facility is currently enabled for
    root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
    is active in any mode. The rationale behind this is that still a lot
    of systems ship with world read permissions on kallsyms thus addresses
    should not get suddenly exposed for them. If that situation gets
    much better in future, we always have the option to change the
    default on this. Likewise, unprivileged programs are not allowed
    to add entries there either, but that is less of a concern as most
    such programs types relevant in this context are for root-only anyway.
    If enabled, call graphs and stack traces will then show a correct
    attribution; one example is illustrated below, where the trace is
    now visible in tooling such as perf script --kallsyms=/proc/kallsyms
    and friends.

    Before:

    7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

    After:

    7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
    [...]
    7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Feb, 2017

1 commit


04 Feb, 2017

1 commit

  • This module is responsible for the ife encapsulation protocol
    encode/decode logics. That module can:
    - ife_encode: encode skb and reserve space for the ife meta header
    - ife_decode: decode skb and extract the meta header size
    - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
    header space.
    - ife_tlv_meta_decode - decodes one tlv entry from the packet
    - ife_tlv_meta_next - advance to the next tlv

    Reviewed-by: Jiri Pirko
    Signed-off-by: Yotam Gigi
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: Roman Mashak
    Signed-off-by: David S. Miller

    Yotam Gigi
     

25 Jan, 2017

1 commit

  • Add a general way for kernel modules to sample packets, without being tied
    to any specific subsystem. This netlink channel can be used by tc,
    iptables, etc. and allow to standardize packet sampling in the kernel.

    For every sampled packet, the psample module adds the following metadata
    fields:

    PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable

    PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable

    PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
    truncated during sampling

    PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
    user who initiated the sampling. This field allows the user to
    differentiate between several samplers working simultaneously and
    filter packets relevant to him

    PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
    sequence is kept for each group

    PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets

    PSAMPLE_ATTR_DATA - the actual packet bits

    The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
    group. In addition, add the GET_GROUPS netlink command which allows the
    user to see the current sample groups, their refcount and sequence number.
    This command currently supports only netlink dump mode.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Reviewed-by: Jamal Hadi Salim
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Yotam Gigi
     

12 Jan, 2017

1 commit


11 Jan, 2017

1 commit

  • We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
    not right when CONFIG_NET is disabled and there is no socket interface:

    warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)

    I don't know what the correct solution for this is, but simply removing
    the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
    'if NET' section avoids the warning and does not produce other build
    errors.

    Fixes: 483c4933ea09 ("cgroup: Fix CGROUP_BPF config")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

10 Jan, 2017

1 commit

  • * enable smc module loading and unloading
    * register new socket family
    * basic smc socket creation and deletion
    * use backing TCP socket to run CLC (Connection Layer Control)
    handshake of SMC protocol
    * Setup for infiniband traffic is implemented in follow-on patches.
    For now fallback to TCP socket is always used.

    Signed-off-by: Ursula Braun
    Reviewed-by: Utz Bacher
    Signed-off-by: David S. Miller

    Ursula Braun
     

02 Dec, 2016

1 commit

  • Registers new BPF program types which correspond to the LWT hooks:
    - BPF_PROG_TYPE_LWT_IN => dst_input()
    - BPF_PROG_TYPE_LWT_OUT => dst_output()
    - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

    The separate program types are required to differentiate between the
    capabilities each LWT hook allows:

    * Programs attached to dst_input() or dst_output() are restricted and
    may only read the data of an skb. This prevent modification and
    possible invalidation of already validated packet headers on receive
    and the construction of illegal headers while the IP headers are
    still being assembled.

    * Programs attached to lwtunnel_xmit() are allowed to modify packet
    content as well as prepending an L2 header via a newly introduced
    helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
    invoked after the IP header has been assembled completely.

    All BPF programs receive an skb with L3 headers attached and may return
    one of the following error codes:

    BPF_OK - Continue routing as per nexthop
    BPF_DROP - Drop skb and return EPERM
    BPF_REDIRECT - Redirect skb to device as per redirect() helper.
    (Only valid in lwtunnel_xmit() context)

    The return codes are binary compatible with their TC_ACT_
    relatives to ease compatibility.

    Signed-off-by: Thomas Graf
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Thomas Graf
     

18 Aug, 2016

1 commit

  • This patch introduces a utility for parsing application layer protocol
    messages in a TCP stream. This is a generalization of the mechanism
    implemented of Kernel Connection Multiplexor.

    The API includes a context structure, a set of callbacks, utility
    functions, and a data ready function.

    A stream parser instance is defined by a strparse structure that
    is bound to a TCP socket. The function to initialize the structure
    is:

    int strp_init(struct strparser *strp, struct sock *csk,
    struct strp_callbacks *cb);

    csk is the TCP socket being bound to and cb are the parser callbacks.

    The upper layer calls strp_tcp_data_ready when data is ready on the lower
    socket for strparser to process. This should be called from a data_ready
    callback that is set on the socket:

    void strp_tcp_data_ready(struct strparser *strp);

    A parser is bound to a TCP socket by setting data_ready function to
    strp_tcp_data_ready so that all receive indications on the socket
    go through the parser. This is assumes that sk_user_data is set to
    the strparser structure.

    There are four callbacks.
    - parse_msg is called to parse the message (returns length or error).
    - rcv_msg is called when a complete message has been received
    - read_sock_done is called when data_ready function exits
    - abort_parser is called to abort the parser

    The input to parse_msg is an skbuff which contains next message under
    construction. The backend processing of parse_msg will parse the
    application layer protocol headers to determine the length of
    the message in the stream. The possible return values are:

    >0 : indicates length of successfully parsed message
    0 : indicates more data must be received to parse the message
    -ESTRPIPE : current message should not be processed by the
    kernel, return control of the socket to userspace which
    can proceed to read the messages itself
    other < 0 : Error is parsing, give control back to userspace
    assuming that synchronzation is lost and the stream
    is unrecoverable (application expected to close TCP socket)

    In the case of error return (< 0) strparse will stop the parser
    and report and error to userspace. The application must deal
    with the error. To handle the error the strparser is unbound
    from the TCP socket. If the error indicates that the stream
    TCP socket is at recoverable point (ESTRPIPE) then the application
    can read the TCP socket to process the stream. Once the application
    has dealt with the exceptions in the stream, it may again bind the
    socket to a strparser to continue data operations.

    Note that ENODATA may be returned to the application. In this case
    parse_msg returned -ESTRPIPE, however strparser was unable to maintain
    synchronization of the stream (i.e. some of the message in question
    was already read by the parser).

    strp_pause and strp_unpause are used to provide flow control. For
    instance, if rcv_msg is called but the upper layer can't immediately
    consume the message it can hold the message and pause strparser.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

20 Jul, 2016

1 commit

  • NCSI spec (DSP0222) defines several objects: package, channel, mode,
    filter, version and statistics etc. This introduces the data structs
    to represent those objects and implement functions to manage them.
    Also, this introduces CONFIG_NET_NCSI for the newly implemented NCSI
    stack.

    * The user (e.g. netdev driver) dereference NCSI device by
    "struct ncsi_dev", which is embedded to "struct ncsi_dev_priv".
    The later one is used by NCSI stack internally.
    * Every NCSI device can have multiple packages simultaneously, up
    to 8 packages. It's represented by "struct ncsi_package" and
    identified by 3-bits ID.
    * Every NCSI package can have multiple channels, up to 32. It's
    represented by "struct ncsi_channel" and identified by 5-bits ID.
    * Every NCSI channel has version, statistics, various modes and
    filters. They are represented by "struct ncsi_channel_version",
    "struct ncsi_channel_stats", "struct ncsi_channel_mode" and
    "struct ncsi_channel_filter" separately.
    * Apart from AEN (Asynchronous Event Notification), the NCSI stack
    works in terms of command and response. This introduces "struct
    ncsi_req" to represent a complete NCSI transaction made of NCSI
    request and response.

    link: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
    Signed-off-by: Gavin Shan
    Acked-by: Joel Stanley
    Signed-off-by: David S. Miller

    Gavin Shan
     

17 May, 2016

2 commits

  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.

    Current cBPF ones:

    # git grep -n HAVE_CBPF_JIT arch/
    arch/arm/Kconfig:44: select HAVE_CBPF_JIT
    arch/mips/Kconfig:18: select HAVE_CBPF_JIT if !CPU_MICROMIPS
    arch/powerpc/Kconfig:129: select HAVE_CBPF_JIT
    arch/sparc/Kconfig:35: select HAVE_CBPF_JIT

    Current eBPF ones:

    # git grep -n HAVE_EBPF_JIT arch/
    arch/arm64/Kconfig:61: select HAVE_EBPF_JIT
    arch/s390/Kconfig:126: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
    arch/x86/Kconfig:94: select HAVE_EBPF_JIT if X86_64

    Later code also needs this facility to check for eBPF JITs.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 May, 2016

1 commit

  • Add an implementation of Qualcomm's IPC router protocol, used to
    communicate with service providing remote processors.

    Signed-off-by: Courtney Cavin
    Signed-off-by: Bjorn Andersson
    [bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
    Signed-off-by: Bjorn Andersson
    Signed-off-by: David S. Miller

    Courtney Cavin
     

22 Mar, 2016

1 commit

  • commit 911362c70d ("net: add dst_cache support") added a new
    kconfig option that gets selected by other networking options.
    It seems the intent wasn't to offer this as a user-selectable
    option given the lack of help text, so this patch converts it
    to a silent option.

    Signed-off-by: Dave Jones
    Signed-off-by: David S. Miller

    Dave Jones
     

15 Mar, 2016

1 commit

  • This basic implementation allows to share code between driver using
    hardware buffer management. As the code is hardware agnostic, there is
    few helpers, most of the optimization brought by the an HW BM has to be
    done at driver level.

    Tested-by: Sebastian Careba
    Signed-off-by: Gregory CLEMENT
    Signed-off-by: David S. Miller

    Gregory CLEMENT
     

10 Mar, 2016

1 commit

  • This module implements the Kernel Connection Multiplexor.

    Kernel Connection Multiplexor (KCM) is a facility that provides a
    message based interface over TCP for generic application protocols.
    With KCM an application can efficiently send and receive application
    protocol messages over TCP using datagram sockets.

    For more information see the included Documentation/networking/kcm.txt

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

04 Mar, 2016

1 commit

  • The new NET_DEVLINK infrastructure can be a loadable module, but the drivers
    using it might be built-in, which causes link errors like:

    drivers/net/built-in.o: In function `mlx4_load_one':
    :(.text+0x2fbfda): undefined reference to `devlink_port_register'
    :(.text+0x2fc084): undefined reference to `devlink_port_unregister'
    drivers/net/built-in.o: In function `mlxsw_sx_port_remove':
    :(.text+0x33a03a): undefined reference to `devlink_port_type_clear'
    :(.text+0x33a04e): undefined reference to `devlink_port_unregister'

    There are multiple ways to avoid this:

    a) add 'depends on NET_DEVLINK || !NET_DEVLINK' dependencies
    for each user
    b) use 'select NET_DEVLINK' from each driver that uses it
    and hide the symbol in Kconfig.
    c) make NET_DEVLINK a 'bool' option so we don't have to
    list it as a dependency, and rely on the APIs to be
    stubbed out when it is disabled
    d) use IS_REACHABLE() rather than IS_ENABLED() to check for
    NET_DEVLINK in include/net/devlink.h

    This implements a variation of approach a) by adding an
    intermediate symbol that drivers can depend on, and changes
    the three drivers using it.

    Signed-off-by: Arnd Bergmann
    Fixes: 09d4d087cd48 ("mlx4: Implement devlink interface")
    Fixes: c4745500e988 ("mlxsw: Implement devlink interface")
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

02 Mar, 2016

1 commit

  • Introduce devlink infrastructure for drivers to register and expose to
    userspace via generic Netlink interface.

    There are two basic objects defined:
    devlink - one instance for every "parent device", for example switch ASIC
    devlink port - one instance for every physical port of the device.

    This initial portion implements basic get/dump of objects to userspace.
    Also, port splitter and port type setting is implemented.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

17 Feb, 2016

1 commit

  • This patch add a generic, lockless dst cache implementation.
    The need for lock is avoided updating the dst cache fields
    only in per cpu scope, and requiring that the cache manipulation
    functions are invoked with the local bh disabled.

    The refresh_ts and reset_ts fields are used to ensure the cache
    consistency in case of cuncurrent cache update (dst_cache_set*) and
    reset operation (dst_cache_reset).

    Consider the following scenario:

    CPU1: CPU2:



    dst_cache_reset()
    dst_cache_set()

    The dst entry set passed to dst_cache_set() should not be used
    for later dst cache lookup, because it's obtained using old
    configuration values.

    Since the refresh_ts is updated only on dst_cache lookup, the
    cached value in the above scenario will be discarded on the next
    lookup.

    Signed-off-by: Paolo Abeni
    Suggested-and-acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Paolo Abeni
     

11 Jan, 2016

1 commit

  • This work adds a generalization of the ingress qdisc as a qdisc holding
    only classifiers. The clsact qdisc works on ingress, but also on egress.
    In both cases, it's execution happens without taking the qdisc lock, and
    the main difference for the egress part compared to prior version of [1]
    is that this can be applied with _any_ underlying real egress qdisc (also
    classless ones).

    Besides solving the use-case of [1], that is, allowing for more programmability
    on assigning skb->priority for the mqprio case that is supported by most
    popular 10G+ NICs, it also opens up a lot more flexibility for other tc
    applications. The main work on classification can already be done at clsact
    egress time if the use-case allows and state stored for later retrieval
    f.e. again in skb->priority with major/minors (which is checked by most
    classful qdiscs before consulting tc_classify()) and/or in other skb fields
    like skb->tc_index for some light-weight post-processing to get to the
    eventual classid in case of a classful qdisc. Another use case is that
    the clsact egress part allows to have a central egress counterpart to
    the ingress classifiers, so that classifiers can easily share state (e.g.
    in cls_bpf via eBPF maps) for ingress and egress.

    Currently, default setups like mq + pfifo_fast would require for this to
    use, for example, prio qdisc instead (to get a tc_classify() run) and to
    duplicate the egress classifier for each queue. With clsact, it allows
    for leaving the setup as is, it can additionally assign skb->priority to
    put the skb in one of pfifo_fast's bands and it can share state with maps.
    Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
    w/o the need to perform a skb_dst_force() to hold on to it any longer. In
    lwt case, we can also use this facility to setup dst metadata via cls_bpf
    (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
    that (case of IFF_NO_QUEUE devices, for example).

    The realization can be done without any changes to the scheduler core
    framework. All it takes is that we have two a-priori defined minors/child
    classes, where we can mux between ingress and egress classifier list
    (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
    dev->_tx to avoid extra cacheline miss for moderate loads). The egress
    part is a bit similar modelled to handle_ing() and patched to a noop in
    case the functionality is not used. Both handlers are now called
    sch_handle_ingress() and sch_handle_egress(), code sharing among the two
    doesn't seem practical as there are various minor differences in both
    paths, so that making them conditional in a single handler would rather
    slow things down.

    Full compatibility to ingress qdisc is provided as well. Since both
    piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
    per netdevice, and thus ingress qdisc specific behaviour can be retained
    for user space. This means, either a user does 'tc qdisc add dev foo ingress'
    and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
    alternative, where both, ingress and egress classifier can be configured
    as in the below example. ingress qdisc supports attaching classifier to any
    minor number whereas clsact has two fixed minors for muxing between the
    lists, therefore to not break user space setups, they are better done as
    two separate qdiscs.

    I decided to extend the sch_ingress module with clsact functionality so
    that commonly used code can be reused, the module is being aliased with
    sch_clsact so that it can be auto-loaded properly. Alternative would have been
    to add a flag when initializing ingress to alter its behaviour plus aliasing
    to a different name (as it's more than just ingress). However, the first would
    end up, based on the flag, choosing the new/old behaviour by calling different
    function implementations to handle each anyway, the latter would require to
    register ingress qdisc once again under different alias. So, this really begs
    to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
    by its own that share callbacks used by both.

    Example, adding qdisc:

    # tc qdisc add dev foo clsact
    # tc qdisc show dev foo
    qdisc mq 0: root
    qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc clsact ffff: parent ffff:fff1

    Adding filters (deleting, etc works analogous by specifying ingress/egress):

    # tc filter add dev foo ingress bpf da obj bar.o sec ingress
    # tc filter add dev foo egress bpf da obj bar.o sec egress
    # tc filter show dev foo ingress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
    # tc filter show dev foo egress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

    A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
    show an empty list for clsact. Either using the parent names (ingress/egress)
    or specifying the full major/minor will then show the related filter lists.

    Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

    [1] http://patchwork.ozlabs.org/patch/512949/

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Dec, 2015

1 commit

  • Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
    ->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct
    and its accessors are defined in cgroup-defs.h. This is to prepare
    for overloading the fields with a cgroup pointer.

    This patch mostly performs equivalent conversions but the followings
    are noteworthy.

    * Equality test before updating classid is removed from
    sock_update_classid(). This shouldn't make any noticeable
    difference and a similar test will be implemented on the helper side
    later.

    * sock_update_netprioidx() now takes struct sock_cgroup_data and can
    be moved to netprio_cgroup.h without causing include dependency
    loop. Moved.

    * The dummy version of sock_update_netprioidx() converted to a static
    inline function while at it.

    Signed-off-by: Tejun Heo
    Signed-off-by: David S. Miller

    Tejun Heo
     

30 Sep, 2015

1 commit

  • L3 master devices allow users of the abstraction to influence FIB lookups
    for enslaved devices. Current API provides a means for the master device
    to return a specific FIB table for an enslaved device, to return an
    rtable/custom dst and influence the OIF used for fib lookups.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

22 Jul, 2015

1 commit


14 May, 2015

1 commit

  • This new config switch enables the ingress filtering infrastructure that is
    controlled through the ingress_needed static key. This prepares the
    introduction of the Netfilter ingress hook that resides under this unique
    static key.

    Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
    problem since this also depends on CONFIG_NET_CLS_ACT.

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Pablo Neira