04 Sep, 2017

1 commit


30 Aug, 2017

1 commit

  • Add a new nsh/ directory. It currently holds only GSO functions but more
    will come: in particular, code shared by openvswitch and tc to manipulate
    NSH headers.

    For now, assume there's no hardware support for NSH segmentation. We can
    always introduce netdev->nsh_features later.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

29 Aug, 2017

2 commits

  • It's time to get rid of IRDA. It's long been broken, and no one seems
    to use it anymore. So move it to staging and after a while, we can
    delete it from there.

    To start, move the network irda core from net/irda to
    drivers/staging/irda/net/

    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: David S. Miller

    Greg Kroah-Hartman
     
  • SOCKMAP uses strparser code (compiled with Kconfig option
    CONFIG_STREAM_PARSER) to run the parser BPF program. Without this
    config option set sockmap wont be compiled. However, at the moment
    the only way to pull in the strparser code is to enable KCM.

    To resolve this create a BPF specific config option to pull
    only the strparser piece in that sockmap needs. This also
    allows folks who want to use BPF/syscall/maps but don't need
    sockmap to easily opt out.

    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

16 Jun, 2017

1 commit

  • Software implementation of transport layer security, implemented using ULP
    infrastructure. tcp proto_ops are replaced with tls equivalents of sendmsg and
    sendpage.

    Only symmetric crypto is done in the kernel, keys are passed by setsockopt
    after the handshake is complete. All control messages are supported via CMSG
    data - the actual symmetric encryption is the same, just the message type needs
    to be passed separately.

    For user API, please see Documentation patch.

    Pieces that can be shared between hw and sw implementation
    are in tls_main.c

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: Aviad Yehezkel
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

18 Feb, 2017

1 commit

  • Long standing issue with JITed programs is that stack traces from
    function tracing check whether a given address is kernel code
    through {__,}kernel_text_address(), which checks for code in core
    kernel, modules and dynamically allocated ftrace trampolines. But
    what is still missing is BPF JITed programs (interpreted programs
    are not an issue as __bpf_prog_run() will be attributed to them),
    thus when a stack trace is triggered, the code walking the stack
    won't see any of the JITed ones. The same for address correlation
    done from user space via reading /proc/kallsyms. This is read by
    tools like perf, but the latter is also useful for permanent live
    tracing with eBPF itself in combination with stack maps when other
    eBPF types are part of the callchain. See offwaketime example on
    dumping stack from a map.

    This work tries to tackle that issue by making the addresses and
    symbols known to the kernel. The lookup from *kernel_text_address()
    is implemented through a latched RB tree that can be read under
    RCU in fast-path that is also shared for symbol/size/offset lookup
    for a specific given address in kallsyms. The slow-path iteration
    through all symbols in the seq file done via RCU list, which holds
    a tiny fraction of all exported ksyms, usually below 0.1 percent.
    Function symbols are exported as bpf_prog_, in order to aide
    debugging and attribution. This facility is currently enabled for
    root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
    is active in any mode. The rationale behind this is that still a lot
    of systems ship with world read permissions on kallsyms thus addresses
    should not get suddenly exposed for them. If that situation gets
    much better in future, we always have the option to change the
    default on this. Likewise, unprivileged programs are not allowed
    to add entries there either, but that is less of a concern as most
    such programs types relevant in this context are for root-only anyway.
    If enabled, call graphs and stack traces will then show a correct
    attribution; one example is illustrated below, where the trace is
    now visible in tooling such as perf script --kallsyms=/proc/kallsyms
    and friends.

    Before:

    7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

    After:

    7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
    [...]
    7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Feb, 2017

1 commit


04 Feb, 2017

1 commit

  • This module is responsible for the ife encapsulation protocol
    encode/decode logics. That module can:
    - ife_encode: encode skb and reserve space for the ife meta header
    - ife_decode: decode skb and extract the meta header size
    - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
    header space.
    - ife_tlv_meta_decode - decodes one tlv entry from the packet
    - ife_tlv_meta_next - advance to the next tlv

    Reviewed-by: Jiri Pirko
    Signed-off-by: Yotam Gigi
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: Roman Mashak
    Signed-off-by: David S. Miller

    Yotam Gigi
     

25 Jan, 2017

1 commit

  • Add a general way for kernel modules to sample packets, without being tied
    to any specific subsystem. This netlink channel can be used by tc,
    iptables, etc. and allow to standardize packet sampling in the kernel.

    For every sampled packet, the psample module adds the following metadata
    fields:

    PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable

    PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable

    PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
    truncated during sampling

    PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
    user who initiated the sampling. This field allows the user to
    differentiate between several samplers working simultaneously and
    filter packets relevant to him

    PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
    sequence is kept for each group

    PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets

    PSAMPLE_ATTR_DATA - the actual packet bits

    The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
    group. In addition, add the GET_GROUPS netlink command which allows the
    user to see the current sample groups, their refcount and sequence number.
    This command currently supports only netlink dump mode.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Reviewed-by: Jamal Hadi Salim
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Yotam Gigi
     

12 Jan, 2017

1 commit


11 Jan, 2017

1 commit

  • We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
    not right when CONFIG_NET is disabled and there is no socket interface:

    warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)

    I don't know what the correct solution for this is, but simply removing
    the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
    'if NET' section avoids the warning and does not produce other build
    errors.

    Fixes: 483c4933ea09 ("cgroup: Fix CGROUP_BPF config")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

10 Jan, 2017

1 commit

  • * enable smc module loading and unloading
    * register new socket family
    * basic smc socket creation and deletion
    * use backing TCP socket to run CLC (Connection Layer Control)
    handshake of SMC protocol
    * Setup for infiniband traffic is implemented in follow-on patches.
    For now fallback to TCP socket is always used.

    Signed-off-by: Ursula Braun
    Reviewed-by: Utz Bacher
    Signed-off-by: David S. Miller

    Ursula Braun
     

02 Dec, 2016

1 commit

  • Registers new BPF program types which correspond to the LWT hooks:
    - BPF_PROG_TYPE_LWT_IN => dst_input()
    - BPF_PROG_TYPE_LWT_OUT => dst_output()
    - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

    The separate program types are required to differentiate between the
    capabilities each LWT hook allows:

    * Programs attached to dst_input() or dst_output() are restricted and
    may only read the data of an skb. This prevent modification and
    possible invalidation of already validated packet headers on receive
    and the construction of illegal headers while the IP headers are
    still being assembled.

    * Programs attached to lwtunnel_xmit() are allowed to modify packet
    content as well as prepending an L2 header via a newly introduced
    helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
    invoked after the IP header has been assembled completely.

    All BPF programs receive an skb with L3 headers attached and may return
    one of the following error codes:

    BPF_OK - Continue routing as per nexthop
    BPF_DROP - Drop skb and return EPERM
    BPF_REDIRECT - Redirect skb to device as per redirect() helper.
    (Only valid in lwtunnel_xmit() context)

    The return codes are binary compatible with their TC_ACT_
    relatives to ease compatibility.

    Signed-off-by: Thomas Graf
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Thomas Graf
     

18 Aug, 2016

1 commit

  • This patch introduces a utility for parsing application layer protocol
    messages in a TCP stream. This is a generalization of the mechanism
    implemented of Kernel Connection Multiplexor.

    The API includes a context structure, a set of callbacks, utility
    functions, and a data ready function.

    A stream parser instance is defined by a strparse structure that
    is bound to a TCP socket. The function to initialize the structure
    is:

    int strp_init(struct strparser *strp, struct sock *csk,
    struct strp_callbacks *cb);

    csk is the TCP socket being bound to and cb are the parser callbacks.

    The upper layer calls strp_tcp_data_ready when data is ready on the lower
    socket for strparser to process. This should be called from a data_ready
    callback that is set on the socket:

    void strp_tcp_data_ready(struct strparser *strp);

    A parser is bound to a TCP socket by setting data_ready function to
    strp_tcp_data_ready so that all receive indications on the socket
    go through the parser. This is assumes that sk_user_data is set to
    the strparser structure.

    There are four callbacks.
    - parse_msg is called to parse the message (returns length or error).
    - rcv_msg is called when a complete message has been received
    - read_sock_done is called when data_ready function exits
    - abort_parser is called to abort the parser

    The input to parse_msg is an skbuff which contains next message under
    construction. The backend processing of parse_msg will parse the
    application layer protocol headers to determine the length of
    the message in the stream. The possible return values are:

    >0 : indicates length of successfully parsed message
    0 : indicates more data must be received to parse the message
    -ESTRPIPE : current message should not be processed by the
    kernel, return control of the socket to userspace which
    can proceed to read the messages itself
    other < 0 : Error is parsing, give control back to userspace
    assuming that synchronzation is lost and the stream
    is unrecoverable (application expected to close TCP socket)

    In the case of error return (< 0) strparse will stop the parser
    and report and error to userspace. The application must deal
    with the error. To handle the error the strparser is unbound
    from the TCP socket. If the error indicates that the stream
    TCP socket is at recoverable point (ESTRPIPE) then the application
    can read the TCP socket to process the stream. Once the application
    has dealt with the exceptions in the stream, it may again bind the
    socket to a strparser to continue data operations.

    Note that ENODATA may be returned to the application. In this case
    parse_msg returned -ESTRPIPE, however strparser was unable to maintain
    synchronization of the stream (i.e. some of the message in question
    was already read by the parser).

    strp_pause and strp_unpause are used to provide flow control. For
    instance, if rcv_msg is called but the upper layer can't immediately
    consume the message it can hold the message and pause strparser.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

20 Jul, 2016

1 commit

  • NCSI spec (DSP0222) defines several objects: package, channel, mode,
    filter, version and statistics etc. This introduces the data structs
    to represent those objects and implement functions to manage them.
    Also, this introduces CONFIG_NET_NCSI for the newly implemented NCSI
    stack.

    * The user (e.g. netdev driver) dereference NCSI device by
    "struct ncsi_dev", which is embedded to "struct ncsi_dev_priv".
    The later one is used by NCSI stack internally.
    * Every NCSI device can have multiple packages simultaneously, up
    to 8 packages. It's represented by "struct ncsi_package" and
    identified by 3-bits ID.
    * Every NCSI package can have multiple channels, up to 32. It's
    represented by "struct ncsi_channel" and identified by 5-bits ID.
    * Every NCSI channel has version, statistics, various modes and
    filters. They are represented by "struct ncsi_channel_version",
    "struct ncsi_channel_stats", "struct ncsi_channel_mode" and
    "struct ncsi_channel_filter" separately.
    * Apart from AEN (Asynchronous Event Notification), the NCSI stack
    works in terms of command and response. This introduces "struct
    ncsi_req" to represent a complete NCSI transaction made of NCSI
    request and response.

    link: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
    Signed-off-by: Gavin Shan
    Acked-by: Joel Stanley
    Signed-off-by: David S. Miller

    Gavin Shan
     

17 May, 2016

2 commits

  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.

    Current cBPF ones:

    # git grep -n HAVE_CBPF_JIT arch/
    arch/arm/Kconfig:44: select HAVE_CBPF_JIT
    arch/mips/Kconfig:18: select HAVE_CBPF_JIT if !CPU_MICROMIPS
    arch/powerpc/Kconfig:129: select HAVE_CBPF_JIT
    arch/sparc/Kconfig:35: select HAVE_CBPF_JIT

    Current eBPF ones:

    # git grep -n HAVE_EBPF_JIT arch/
    arch/arm64/Kconfig:61: select HAVE_EBPF_JIT
    arch/s390/Kconfig:126: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
    arch/x86/Kconfig:94: select HAVE_EBPF_JIT if X86_64

    Later code also needs this facility to check for eBPF JITs.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 May, 2016

1 commit

  • Add an implementation of Qualcomm's IPC router protocol, used to
    communicate with service providing remote processors.

    Signed-off-by: Courtney Cavin
    Signed-off-by: Bjorn Andersson
    [bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
    Signed-off-by: Bjorn Andersson
    Signed-off-by: David S. Miller

    Courtney Cavin
     

22 Mar, 2016

1 commit

  • commit 911362c70d ("net: add dst_cache support") added a new
    kconfig option that gets selected by other networking options.
    It seems the intent wasn't to offer this as a user-selectable
    option given the lack of help text, so this patch converts it
    to a silent option.

    Signed-off-by: Dave Jones
    Signed-off-by: David S. Miller

    Dave Jones
     

15 Mar, 2016

1 commit

  • This basic implementation allows to share code between driver using
    hardware buffer management. As the code is hardware agnostic, there is
    few helpers, most of the optimization brought by the an HW BM has to be
    done at driver level.

    Tested-by: Sebastian Careba
    Signed-off-by: Gregory CLEMENT
    Signed-off-by: David S. Miller

    Gregory CLEMENT
     

10 Mar, 2016

1 commit

  • This module implements the Kernel Connection Multiplexor.

    Kernel Connection Multiplexor (KCM) is a facility that provides a
    message based interface over TCP for generic application protocols.
    With KCM an application can efficiently send and receive application
    protocol messages over TCP using datagram sockets.

    For more information see the included Documentation/networking/kcm.txt

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

04 Mar, 2016

1 commit

  • The new NET_DEVLINK infrastructure can be a loadable module, but the drivers
    using it might be built-in, which causes link errors like:

    drivers/net/built-in.o: In function `mlx4_load_one':
    :(.text+0x2fbfda): undefined reference to `devlink_port_register'
    :(.text+0x2fc084): undefined reference to `devlink_port_unregister'
    drivers/net/built-in.o: In function `mlxsw_sx_port_remove':
    :(.text+0x33a03a): undefined reference to `devlink_port_type_clear'
    :(.text+0x33a04e): undefined reference to `devlink_port_unregister'

    There are multiple ways to avoid this:

    a) add 'depends on NET_DEVLINK || !NET_DEVLINK' dependencies
    for each user
    b) use 'select NET_DEVLINK' from each driver that uses it
    and hide the symbol in Kconfig.
    c) make NET_DEVLINK a 'bool' option so we don't have to
    list it as a dependency, and rely on the APIs to be
    stubbed out when it is disabled
    d) use IS_REACHABLE() rather than IS_ENABLED() to check for
    NET_DEVLINK in include/net/devlink.h

    This implements a variation of approach a) by adding an
    intermediate symbol that drivers can depend on, and changes
    the three drivers using it.

    Signed-off-by: Arnd Bergmann
    Fixes: 09d4d087cd48 ("mlx4: Implement devlink interface")
    Fixes: c4745500e988 ("mlxsw: Implement devlink interface")
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

02 Mar, 2016

1 commit

  • Introduce devlink infrastructure for drivers to register and expose to
    userspace via generic Netlink interface.

    There are two basic objects defined:
    devlink - one instance for every "parent device", for example switch ASIC
    devlink port - one instance for every physical port of the device.

    This initial portion implements basic get/dump of objects to userspace.
    Also, port splitter and port type setting is implemented.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

17 Feb, 2016

1 commit

  • This patch add a generic, lockless dst cache implementation.
    The need for lock is avoided updating the dst cache fields
    only in per cpu scope, and requiring that the cache manipulation
    functions are invoked with the local bh disabled.

    The refresh_ts and reset_ts fields are used to ensure the cache
    consistency in case of cuncurrent cache update (dst_cache_set*) and
    reset operation (dst_cache_reset).

    Consider the following scenario:

    CPU1: CPU2:



    dst_cache_reset()
    dst_cache_set()

    The dst entry set passed to dst_cache_set() should not be used
    for later dst cache lookup, because it's obtained using old
    configuration values.

    Since the refresh_ts is updated only on dst_cache lookup, the
    cached value in the above scenario will be discarded on the next
    lookup.

    Signed-off-by: Paolo Abeni
    Suggested-and-acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Paolo Abeni
     

11 Jan, 2016

1 commit

  • This work adds a generalization of the ingress qdisc as a qdisc holding
    only classifiers. The clsact qdisc works on ingress, but also on egress.
    In both cases, it's execution happens without taking the qdisc lock, and
    the main difference for the egress part compared to prior version of [1]
    is that this can be applied with _any_ underlying real egress qdisc (also
    classless ones).

    Besides solving the use-case of [1], that is, allowing for more programmability
    on assigning skb->priority for the mqprio case that is supported by most
    popular 10G+ NICs, it also opens up a lot more flexibility for other tc
    applications. The main work on classification can already be done at clsact
    egress time if the use-case allows and state stored for later retrieval
    f.e. again in skb->priority with major/minors (which is checked by most
    classful qdiscs before consulting tc_classify()) and/or in other skb fields
    like skb->tc_index for some light-weight post-processing to get to the
    eventual classid in case of a classful qdisc. Another use case is that
    the clsact egress part allows to have a central egress counterpart to
    the ingress classifiers, so that classifiers can easily share state (e.g.
    in cls_bpf via eBPF maps) for ingress and egress.

    Currently, default setups like mq + pfifo_fast would require for this to
    use, for example, prio qdisc instead (to get a tc_classify() run) and to
    duplicate the egress classifier for each queue. With clsact, it allows
    for leaving the setup as is, it can additionally assign skb->priority to
    put the skb in one of pfifo_fast's bands and it can share state with maps.
    Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
    w/o the need to perform a skb_dst_force() to hold on to it any longer. In
    lwt case, we can also use this facility to setup dst metadata via cls_bpf
    (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
    that (case of IFF_NO_QUEUE devices, for example).

    The realization can be done without any changes to the scheduler core
    framework. All it takes is that we have two a-priori defined minors/child
    classes, where we can mux between ingress and egress classifier list
    (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
    dev->_tx to avoid extra cacheline miss for moderate loads). The egress
    part is a bit similar modelled to handle_ing() and patched to a noop in
    case the functionality is not used. Both handlers are now called
    sch_handle_ingress() and sch_handle_egress(), code sharing among the two
    doesn't seem practical as there are various minor differences in both
    paths, so that making them conditional in a single handler would rather
    slow things down.

    Full compatibility to ingress qdisc is provided as well. Since both
    piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
    per netdevice, and thus ingress qdisc specific behaviour can be retained
    for user space. This means, either a user does 'tc qdisc add dev foo ingress'
    and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
    alternative, where both, ingress and egress classifier can be configured
    as in the below example. ingress qdisc supports attaching classifier to any
    minor number whereas clsact has two fixed minors for muxing between the
    lists, therefore to not break user space setups, they are better done as
    two separate qdiscs.

    I decided to extend the sch_ingress module with clsact functionality so
    that commonly used code can be reused, the module is being aliased with
    sch_clsact so that it can be auto-loaded properly. Alternative would have been
    to add a flag when initializing ingress to alter its behaviour plus aliasing
    to a different name (as it's more than just ingress). However, the first would
    end up, based on the flag, choosing the new/old behaviour by calling different
    function implementations to handle each anyway, the latter would require to
    register ingress qdisc once again under different alias. So, this really begs
    to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
    by its own that share callbacks used by both.

    Example, adding qdisc:

    # tc qdisc add dev foo clsact
    # tc qdisc show dev foo
    qdisc mq 0: root
    qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc clsact ffff: parent ffff:fff1

    Adding filters (deleting, etc works analogous by specifying ingress/egress):

    # tc filter add dev foo ingress bpf da obj bar.o sec ingress
    # tc filter add dev foo egress bpf da obj bar.o sec egress
    # tc filter show dev foo ingress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
    # tc filter show dev foo egress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

    A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
    show an empty list for clsact. Either using the parent names (ingress/egress)
    or specifying the full major/minor will then show the related filter lists.

    Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

    [1] http://patchwork.ozlabs.org/patch/512949/

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Dec, 2015

1 commit

  • Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
    ->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct
    and its accessors are defined in cgroup-defs.h. This is to prepare
    for overloading the fields with a cgroup pointer.

    This patch mostly performs equivalent conversions but the followings
    are noteworthy.

    * Equality test before updating classid is removed from
    sock_update_classid(). This shouldn't make any noticeable
    difference and a similar test will be implemented on the helper side
    later.

    * sock_update_netprioidx() now takes struct sock_cgroup_data and can
    be moved to netprio_cgroup.h without causing include dependency
    loop. Moved.

    * The dummy version of sock_update_netprioidx() converted to a static
    inline function while at it.

    Signed-off-by: Tejun Heo
    Signed-off-by: David S. Miller

    Tejun Heo
     

30 Sep, 2015

1 commit

  • L3 master devices allow users of the abstraction to influence FIB lookups
    for enslaved devices. Current API provides a means for the master device
    to return a specific FIB table for an enslaved device, to return an
    rtable/custom dst and influence the OIF used for fib lookups.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

22 Jul, 2015

1 commit


14 May, 2015

1 commit

  • This new config switch enables the ingress filtering infrastructure that is
    controlled through the ingress_needed static key. This prepares the
    introduction of the Netfilter ingress hook that resides under this unique
    static key.

    Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
    problem since this also depends on CONFIG_NET_CLS_ACT.

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Pablo Neira
     

07 Jan, 2015

1 commit


03 Dec, 2014

1 commit

  • The goal of this is to provide a possibility to support various switch
    chips. Drivers should implement relevant ndos to do so. Now there is
    only one ndo defined:
    - for getting physical switch id is in place.

    Note that user can use random port netdevice to access the switch.

    Signed-off-by: Jiri Pirko
    Reviewed-by: Thomas Graf
    Acked-by: Andy Gospodarek
    Signed-off-by: David S. Miller

    Jiri Pirko
     

28 Oct, 2014

1 commit

  • introduce two configs:
    - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
    depend on
    - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use

    that solves several problems:
    - tracing and others that wish to use eBPF don't need to depend on NET.
    They can use BPF_SYSCALL to allow loading from userspace or select BPF
    to use it directly from kernel in NET-less configs.
    - in 3.18 programs cannot be attached to events yet, so don't force it on
    - when the rest of eBPF infra is there in 3.19+, it's still useful to
    switch it off to minimize kernel size

    bloat-o-meter on x64 shows:
    add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)

    tested with many different config combinations. Hopefully didn't miss anything.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

11 Oct, 2014

1 commit

  • minimal configurations where EPOLL, PERF_EVENTS, etc are disabled,
    but NET is enabled, are failing to build with link error:
    kernel/built-in.o: In function `bpf_prog_load':
    syscall.c:(.text+0x3b728): undefined reference to `anon_inode_getfd'

    fix it by selecting ANON_INODES when NET is enabled

    Reported-by: Michal Sojka
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

01 Oct, 2014

1 commit

  • Eric reports build failure with
    CONFIG_BRIDGE_NETFILTER=n

    We insist to build br_nf_core.o unconditionally, but we must only do so
    if br_netfilter was enabled, else it fails to build due to
    functions being defined to empty stubs (and some structure members
    being defined out).

    Also, BRIDGE_NETFILTER=y|m makes no sense when BRIDGE=n.

    Fixes: 34666d467 (netfilter: bridge: move br_netfilter out of the core)
    Reported-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Florian Westphal
     

27 Sep, 2014

1 commit

  • Jesper reported that br_netfilter always registers the hooks since
    this is part of the bridge core. This harms performance for people that
    don't need this.

    This patch modularizes br_netfilter so it can be rmmod'ed, thus,
    the hooks can be unregistered. I think the bridge netfilter should have
    been a separated module since the beginning, Patrick agreed on that.

    Note that this is breaking compatibility for users that expect that
    bridge netfilter is going to be available after explicitly 'modprobe
    bridge' or via automatic load through brctl.

    However, the damage can be easily undone by modprobing br_netfilter.
    The bridge core also spots a message to provide a clue to people that
    didn't notice that this has been deprecated.

    On top of that, the plan is that nftables will not rely on this software
    layer, but integrate the connection tracking into the bridge layer to
    enable stateful filtering and NAT, which is was bridge netfilter users
    seem to require.

    This patch still keeps the fake_dst_ops in the bridge core, since this
    is required by when the bridge port is initialized. So we can safely
    modprobe/rmmod br_netfilter anytime.

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Florian Westphal

    Pablo Neira Ayuso
     

12 Jul, 2014

1 commit

  • This patch moves generic code which is used by bluetooth and ieee802154
    6lowpan to a new net/6lowpan directory. This directory contains generic
    6LoWPAN code which is shared between bluetooth and ieee802154 MAC-Layer.

    This is the IPHC - "IPv6 Header Compression" format at the moment. Which
    is described by RFC 6282 [0]. The BLTE 6LoWPAN draft describes that the
    IPHC is the same format like IEEE 802.15.4, see [1].

    Futuremore we can put more code into this directory which is shared
    between BLTE and IEEE 802.15.4 6LoWPAN like RFC 6775 or the routing
    protocol RPL RFC 6550.

    To avoid naming conflicts I renamed 6lowpan-y to ieee802154_6lowpan-y
    in net/ieee802154/Makefile.

    [0] http://tools.ietf.org/html/rfc6282
    [1] http://tools.ietf.org/html/draft-ietf-6lowpan-btle-12#section-3.2
    [2] http://tools.ietf.org/html/rfc6775
    [3] http://tools.ietf.org/html/rfc6550

    Signed-off-by: Alexander Aring
    Acked-by: Jukka Rissanen
    Signed-off-by: Marcel Holtmann

    Alexander Aring
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

02 Apr, 2014

1 commit

  • This commit fixes a build error reported by Fengguang, that is
    triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:

    ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!

    The fix is to introduce its own file for the PTP BPF classifier,
    so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
    it independently from each other. IXP4xx driver on ARM needs to
    select it as well since it does not seem to select PTP_1588_CLOCK
    or similar that would pull it in automatically.

    This also allows for hiding all of the internals of the BPF PTP
    program inside that file, and only exporting relevant API bits
    to drivers.

    This patch also adds a kdoc documentation of ptp_classify_raw()
    API to make it clear that it can return PTP_CLASS_* defines. Also,
    the BPF program has been translated into bpf_asm code, so that it
    can be more easily read and altered (extensively documented in [1]).

    In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
    tools, so the commented program can simply be translated via
    `./bpf_asm -c prog` where prog is a file that contains the
    commented code. This makes it easily readable/verifiable and when
    there's a need to change something, jump offsets etc do not need
    to be replaced manually which can be very error prone. Instead,
    a newly translated version via bpf_asm can simply replace the old
    code. I have checked opcode diffs before/after and it's the very
    same filter.

    [1] Documentation/networking/filter.txt

    Fixes: 164d8c666521 ("net: ptp: do not reimplement PTP/BPF classifier")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Richard Cochran
    Cc: Jiri Benc
    Acked-by: Richard Cochran
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

08 Feb, 2014

1 commit

  • net_prio is the only cgroup which is allowed to be built as a module.
    The savings from allowing one controller to be built as a module are
    tiny especially given that cgroup module support itself adds quite a
    bit of complexity.

    Given that none of other controllers has much chance of being made a
    module and that we're unlikely to add new modular controllers, the
    added complexity is simply not justifiable.

    As a first step to drop cgroup module support, this patch changes the
    config option to bool from tristate and drops module related code from
    it.

    Also, while an earlier commit fe1217c4f3f7 ("net: net_cls: move
    cgroupfs classid handling into core") dropped module support from
    net_cls cgroup, it retained a call to cgroup_load_subsys(), which is
    noop for built-in controllers. Drop it along with
    init_netclassid_cgroup().

    v2: Removed modular version of task_netprioidx() in
    include/net/netprio_cgroup.h as suggested by Li Zefan.

    v3: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core"). net_cls cgroup part is mostly
    dropped except for removal of init_netclassid_cgroup().

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: Li Zefan
    Cc: Thomas Graf

    Tejun Heo
     

04 Jan, 2014

1 commit

  • While we're at it and introduced CGROUP_NET_CLASSID, lets also make
    NETPRIO_CGROUP more consistent with the rest of cgroups and rename it
    into CONFIG_CGROUP_NET_PRIO so that for networking, we now have
    CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG
    option consistent among networking cgroups, but also among cgroups
    CONFIG conventions in general as the vast majority has a prefix of
    CONFIG_CGROUP_.

    Signed-off-by: Daniel Borkmann
    Cc: Zefan Li
    Cc: cgroups@vger.kernel.org
    Acked-by: Li Zefan
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann