26 Mar, 2020

1 commit

  • net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
    net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
    pkt->skb->tc_redirected = 1;
    ^~
    net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
    pkt->skb->tc_from_ingress = 1;
    ^~

    To avoid a direct dependency with tc actions from netfilter, wrap the
    redirect bits around CONFIG_NET_REDIRECT and move helpers to
    include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
    only existing client of these bits in the tree.

    This patch adds skb_set_redirected() that sets on the redirected bit
    on the skbuff, it specifies if the packet was redirect from ingress
    and resets the timestamp (timestamp reset was originally missing in the
    netfilter bugfix).

    Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
    Reported-by: noreply@ellerman.id.au
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

21 Feb, 2020

1 commit

  • The description says 'If unsure, say N.' but
    the module is built as M by default (once
    the dependencies are satisfied).

    When the module is selected (Y or M), it enables
    NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS
    which alter kernel internal structures.

    We (Android Studio Emulator) currently do not
    use this module and think this it is more consistent
    to have it disabled by default as opposite to
    disabling it explicitly to prevent enabling
    NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS.

    Signed-off-by: Roman Kiryanov
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Roman Kiryanov
     

24 Jan, 2020

1 commit

  • Implements the infrastructure for MPTCP sockets.

    MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
    sockets are only managed by the MPTCP socket that owns them and are not
    visible from userspace. This commit allows a userspace program to open
    an MPTCP socket with:

    sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

    The resulting socket is simply a wrapper around a single regular TCP
    socket, without any of the MPTCP protocol implemented over the wire.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Peter Krystad
    Signed-off-by: Peter Krystad
    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Mat Martineau
     

28 Dec, 2019

1 commit

  • Basic genetlink and init infrastructure for the netlink interface, register
    genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to
    make the build optional. Add initial overall interface description into
    Documentation/networking/ethtool-netlink.rst, further patches will add more
    detailed information.

    Signed-off-by: Michal Kubecek
    Reviewed-by: Florian Fainelli
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Michal Kubecek
     

26 Dec, 2019

1 commit

  • While PHY time stamping drivers can simply attach their interface
    directly to the PHY instance, stand alone drivers require support in
    order to manage their services. Non-PHY MII time stamping drivers
    have a control interface over another bus like I2C, SPI, UART, or via
    a memory mapped peripheral. The controller device will be associated
    with one or more time stamping channels, each of which sits snoops in
    on a MII bus.

    This patch provides a glue layer that will enable time stamping
    channels to find their controlling device.

    Signed-off-by: Richard Cochran
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Richard Cochran
     

22 Nov, 2019

1 commit


18 Aug, 2019

1 commit

  • Add the basic packet trap infrastructure that allows device drivers to
    register their supported packet traps and trap groups with devlink.

    Each driver is expected to provide basic information about each
    supported trap, such as name and ID, but also the supported metadata
    types that will accompany each packet trapped via the trap. The
    currently supported metadata type is just the input port, but more will
    be added in the future. For example, output port and traffic class.

    Trap groups allow users to set the action of all member traps. In
    addition, users can retrieve per-group statistics in case per-trap
    statistics are too narrow. In the future, the trap group object can be
    extended with more attributes, such as policer settings which will limit
    the amount of traffic generated by member traps towards the CPU.

    Beside registering their packet traps with devlink, drivers are also
    expected to report trapped packets to devlink along with relevant
    metadata. devlink will maintain packets and bytes statistics for each
    packet trap and will potentially report the trapped packet with its
    metadata to user space via drop monitor netlink channel.

    The interface towards the drivers is simple and allows devlink to set
    the action of the trap. Currently, only two actions are supported:
    'trap' and 'drop'. When set to 'trap', the device is expected to provide
    the sole copy of the packet to the driver which will pass it to devlink.
    When set to 'drop', the device is expected to drop the packet and not
    send a copy to the driver. In the future, more actions can be added,
    such as 'mirror'.

    Signed-off-by: Ido Schimmel
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

18 Jun, 2019

1 commit

  • Using a bare block cipher in non-crypto code is almost always a bad idea,
    not only for security reasons (and we've seen some examples of this in
    the kernel in the past), but also for performance reasons.

    In the TCP fastopen case, we call into the bare AES block cipher one or
    two times (depending on whether the connection is IPv4 or IPv6). On most
    systems, this results in a call chain such as

    crypto_cipher_encrypt_one(ctx, dst, src)
    crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
    aesni_encrypt
    kernel_fpu_begin();
    aesni_enc(ctx, dst, src); // asm routine
    kernel_fpu_end();

    It is highly unlikely that the use of special AES instructions has a
    benefit in this case, especially since we are doing the above twice
    for IPv6 connections, instead of using a transform which can process
    the entire input in one go.

    We could switch to the cbcmac(aes) shash, which would at least get
    rid of the duplicated overhead in *some* cases (i.e., today, only
    arm64 has an accelerated implementation of cbcmac(aes), while x86 will
    end up using the generic cbcmac template wrapping the AES-NI cipher,
    which basically ends up doing exactly the above). However, in the given
    context, it makes more sense to use a light-weight MAC algorithm that
    is more suitable for the purpose at hand, such as SipHash.

    Since the output size of SipHash already matches our chosen value for
    TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
    sizes, this greatly simplifies the code as well.

    NOTE: Server farms backing a single server IP for load balancing purposes
    and sharing a single fastopen key will be adversely affected by
    this change unless all systems in the pool receive their kernel
    upgrades at the same time.

    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Ard Biesheuvel
     

21 May, 2019

1 commit


25 Mar, 2019

1 commit

  • Some drivers are becoming more dependent on NET_DEVLINK being selected
    in configuration. With upcoming compat functions, the behavior would be
    wrong in case devlink was not compiled in. So make the drivers select
    NET_DEVLINK and rely on the functions being there, not just stubs.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

27 Feb, 2019

1 commit

  • Being able to build devlink as a module causes growing pains.
    First all drivers had to add a meta dependency to make sure
    they are not built in when devlink is built as a module. Now
    we are struggling to invoke ethtool compat code reliably.

    Make devlink code built-in, users can still not build it at
    all but the dynamically loadable module option is removed.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Florian Fainelli
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

16 Feb, 2019

1 commit

  • Lightweight tunnels are L3 constructs that are used with IP/IP6.

    For example, lwtunnel_xmit is called from ip_output.c and
    ip6_output.c only.

    Make the dependency explicit at least for LWT-BPF, as now they
    call into IP routing.

    V2: added "Reported-by" below.

    Reported-by: Randy Dunlap
    Signed-off-by: Peter Oskolkov
    Acked-by: Randy Dunlap # build-tested
    Signed-off-by: Daniel Borkmann

    Peter Oskolkov
     

20 Dec, 2018

2 commits

  • This converts the bridge netfilter (calling iptables hooks from bridge)
    facility to use the extension infrastructure.

    The bridge_nf specific hooks in skb clone and free paths are removed, they
    have been replaced by the skb_ext hooks that do the same as the bridge nf
    allocations hooks did.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This adds an optional extension infrastructure, with ispec (xfrm) and
    bridge netfilter as first users.
    objdiff shows no changes if kernel is built without xfrm and br_netfilter
    support.

    The third (planned future) user is Multipath TCP which is still
    out-of-tree.
    MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
    numbers used by individual subflows.

    This DSS mapping is read/written from tcp option space on receive and
    written to tcp option space on transmitted tcp packets that are part of
    and MPTCP connection.

    Extending skb_shared_info or adding a private data field to skb fclones
    doesn't work for incoming skb, so a different DSS propagation method would
    be required for the receive side.

    mptcp has same requirements as secpath/bridge netfilter:

    1. extension memory is released when the sk_buff is free'd.
    2. data is shared after cloning an skb (clone inherits extension)
    3. adding extension to an skb will COW the extension buffer if needed.

    The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
    mapping for tx and rx processing.

    Two new members are added to sk_buff:
    1. 'active_extensions' byte (filling a hole), telling which extensions
    are available for this skb.
    This has two purposes.
    a) avoids the need to initialize the pointer.
    b) allows to "delete" an extension by clearing its bit
    value in ->active_extensions.

    While it would be possible to store the active_extensions byte
    in the extension struct instead of sk_buff, there is one problem
    with this:
    When an extension has to be disabled, we can always clear the
    bit in skb->active_extensions. But in case it would be stored in the
    extension buffer itself, we might have to COW it first, if
    we are dealing with a cloned skb. On kmalloc failure we would
    be unable to turn an extension off.

    2. extension pointer, located at the end of the sk_buff.
    If the active_extensions byte is 0, the pointer is undefined,
    it is not initialized on skb allocation.

    This adds extra code to skb clone and free paths (to deal with
    refcount/free of extension area) but this replaces similar code that
    manages skb->nf_bridge and skb->sp structs in the followup patches of
    the series.

    It is possible to add support for extensions that are not preseved on
    clones/copies.

    To do this, it would be needed to define a bitmask of all extensions that
    need copy/cow semantics, and change __skb_ext_copy() to check
    ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
    ->active_extensions to 0 on the new clone.

    This isn't done here because all extensions that get added here
    need the copy/cow semantics.

    v2:
    Allocate entire extension space using kmem_cache.
    Upside is that this allows better tracking of used memory,
    downside is that we will allocate more space than strictly needed in
    most cases (its unlikely that all extensions are active/needed at same
    time for same skb).
    The allocated memory (except the small extension header) is not cleared,
    so no additonal overhead aside from memory usage.

    Avoid atomic_dec_and_test operation on skb_ext_put()
    by using similar trick as kfree_skbmem() does with fclone_ref:
    If recount is 1, there is no concurrent user and we can free right away.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

16 Oct, 2018

1 commit

  • Add a generic sk_msg layer, and convert current sockmap and later
    kTLS over to make use of it. While sk_buff handles network packet
    representation from netdevice up to socket, sk_msg handles data
    representation from application to socket layer.

    This means that sk_msg framework spans across ULP users in the
    kernel, and enables features such as introspection or filtering
    of data with the help of BPF programs that operate on this data
    structure.

    Latter becomes in particular useful for kTLS where data encryption
    is deferred into the kernel, and as such enabling the kernel to
    perform L7 introspection and policy based on BPF for TLS connections
    where the record is being encrypted after BPF has run and came to
    a verdict. In order to get there, first step is to transform open
    coding of scatter-gather list handling into a common core framework
    that subsystems can use.

    The code itself has been split and refactored into three bigger
    pieces: i) the generic sk_msg API which deals with managing the
    scatter gather ring, providing helpers for walking and mangling,
    transferring application data from user space into it, and preparing
    it for BPF pre/post-processing, ii) the plain sock map itself
    where sockets can be attached to or detached from; these bits
    are independent of i) which can now be used also without sock
    map, and iii) the integration with plain TCP as one protocol
    to be used for processing L7 application data (later this could
    e.g. also be extended to other protocols like UDP). The semantics
    are the same with the old sock map code and therefore no change
    of user facing behavior or APIs. While pursuing this work it
    also helped finding a number of bugs in the old sockmap code
    that we've fixed already in earlier commits. The test_sockmap
    kselftest suite passes through fine as well.

    Joint work with John.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

25 Jul, 2018

1 commit


29 May, 2018

1 commit

  • The failover module provides a generic interface for paravirtual drivers
    to register a netdev and a set of ops with a failover instance. The ops
    are used as event handlers that get called to handle netdev register/
    unregister/link change/name change events on slave pci ethernet devices
    with the same mac address as the failover netdev.

    This enables paravirtual drivers to use a VF as an accelerated low latency
    datapath. It also allows migration of VMs with direct attached VFs by
    failing over to the paravirtual datapath when the VF is unplugged.

    Signed-off-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Sridhar Samudrala
     

24 May, 2018

1 commit

  • bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
    and user mode helper code that is embedded into bpfilter.ko

    The steps to build bpfilter.ko are the following:
    - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
    - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
    is converted into bpfilter_umh.o object file
    with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
    Example:
    $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
    0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
    0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
    0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
    - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko

    bpfilter_kern.c is a normal kernel module code that calls
    the fork_usermode_blob() helper to execute part of its own data
    as a user mode process.

    Notice that _binary_net_bpfilter_bpfilter_umh_start - end
    is placed into .init.rodata section, so it's freed as soon as __init
    function of bpfilter.ko is finished.
    As part of __init the bpfilter.ko does first request/reply action
    via two unix pipe provided by fork_usermode_blob() helper to
    make sure that umh is healthy. If not it will kill it via pid.

    Later bpfilter_process_sockopt() will be called from bpfilter hooks
    in get/setsockopt() to pass iptable commands into umh via bpfilter.ko

    If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
    kill umh as well.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

08 May, 2018

1 commit


04 May, 2018

1 commit


01 May, 2018

1 commit


17 Apr, 2018

1 commit

  • Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
    pages on DMA-TX completion time, which have good cross CPU
    performance, given DMA-TX completion time can happen on a remote CPU.

    Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
    Adapted page_pool code to not depend the page allocator and
    integration into struct page. The DMA mapping feature is kept,
    even-though it will not be activated/used in this patchset.

    [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

    V2: Adjustments requested by Tariq
    - Changed page_pool_create return codes, don't return NULL, only
    ERR_PTR, as this simplifies err handling in drivers.

    V4: many small improvements and cleanups
    - Add DOC comment section, that can be used by kernel-doc
    - Improve fallback mode, to work better with refcnt based recycling
    e.g. remove a WARN as pointed out by Tariq
    e.g. quicker fallback if ptr_ring is empty.

    V5: Fixed SPDX license as pointed out by Alexei

    V6: Adjustments requested by Eric Dumazet
    - Adjust ____cacheline_aligned_in_smp usage/placement
    - Move rcu_head in struct page_pool
    - Free pages quicker on destroy, minimize resources delayed an RCU period
    - Remove code for forward/backward compat ABI interface

    V8: Issues found by kbuild test robot
    - Address sparse should be static warnings
    - Only compile+link when a driver use/select page_pool,
    mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

02 Feb, 2018

1 commit

  • Pull staging/IIO updates from Greg KH:
    "Here is the big Staging and IIO driver patches for 4.16-rc1.

    There is the normal amount of new IIO drivers added, like all
    releases.

    The networking IPX and the ncpfs filesystem are moved into the staging
    tree, as they are on their way out of the kernel due to lack of use
    anymore.

    The visorbus subsystem finall has started moving out of the staging
    tree to the "real" part of the kernel, and the most and fsl-mc
    codebases are almost ready to move out, that will probably happen for
    4.17-rc1 if all goes well.

    Other than that, there is a bunch of license header cleanups in the
    tree, along with the normal amount of coding style churn that we all
    know and love for this codebase. I also got frustrated at the
    Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
    huge chunks of it that were never even being used.

    Full details of everything is in the shortlog.

    All of these patches have been in linux-next for a while with no
    reported issues"

    * tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (627 commits)
    staging: rtlwifi: remove redundant initialization of 'cfg_cmd'
    staging: rtl8723bs: remove a couple of redundant initializations
    staging: comedi: reformat lines to 80 chars or less
    staging: lustre: separate a connection destroy from free struct kib_conn
    Staging: rtl8723bs: Use !x instead of NULL comparison
    Staging: rtl8723bs: Remove dead code
    Staging: rtl8723bs: Change names to conform to the kernel code
    staging: ccree: Fix missing blank line after declaration
    staging: rtl8188eu: remove redundant initialization of 'pwrcfgcmd'
    staging: rtlwifi: remove unused RTLHALMAC_ST and RTLPHYDM_ST
    staging: fbtft: remove unused FB_TFT_SSD1325 kconfig
    staging: comedi: dt2811: remove redundant initialization of 'ns'
    staging: wilc1000: fix alignments to match open parenthesis
    staging: wilc1000: removed unnecessary defined enums typedef
    staging: wilc1000: remove unnecessary use of parentheses
    staging: rtl8192u: remove redundant initialization of 'timeout'
    staging: sm750fb: fix CamelCase for dispSet var
    staging: lustre: lnet/selftest: fix compile error on UP build
    staging: rtl8723bs: hal_com_phycfg: Remove unneeded semicolons
    staging: rts5208: Fix "seg_no" calculation in reset_ms_card()
    ...

    Linus Torvalds
     

09 Jan, 2018

1 commit


03 Jan, 2018

1 commit


28 Nov, 2017

1 commit


04 Sep, 2017

1 commit


30 Aug, 2017

1 commit

  • Add a new nsh/ directory. It currently holds only GSO functions but more
    will come: in particular, code shared by openvswitch and tc to manipulate
    NSH headers.

    For now, assume there's no hardware support for NSH segmentation. We can
    always introduce netdev->nsh_features later.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

29 Aug, 2017

2 commits

  • It's time to get rid of IRDA. It's long been broken, and no one seems
    to use it anymore. So move it to staging and after a while, we can
    delete it from there.

    To start, move the network irda core from net/irda to
    drivers/staging/irda/net/

    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: David S. Miller

    Greg Kroah-Hartman
     
  • SOCKMAP uses strparser code (compiled with Kconfig option
    CONFIG_STREAM_PARSER) to run the parser BPF program. Without this
    config option set sockmap wont be compiled. However, at the moment
    the only way to pull in the strparser code is to enable KCM.

    To resolve this create a BPF specific config option to pull
    only the strparser piece in that sockmap needs. This also
    allows folks who want to use BPF/syscall/maps but don't need
    sockmap to easily opt out.

    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

16 Jun, 2017

1 commit

  • Software implementation of transport layer security, implemented using ULP
    infrastructure. tcp proto_ops are replaced with tls equivalents of sendmsg and
    sendpage.

    Only symmetric crypto is done in the kernel, keys are passed by setsockopt
    after the handshake is complete. All control messages are supported via CMSG
    data - the actual symmetric encryption is the same, just the message type needs
    to be passed separately.

    For user API, please see Documentation patch.

    Pieces that can be shared between hw and sw implementation
    are in tls_main.c

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: Aviad Yehezkel
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

18 Feb, 2017

1 commit

  • Long standing issue with JITed programs is that stack traces from
    function tracing check whether a given address is kernel code
    through {__,}kernel_text_address(), which checks for code in core
    kernel, modules and dynamically allocated ftrace trampolines. But
    what is still missing is BPF JITed programs (interpreted programs
    are not an issue as __bpf_prog_run() will be attributed to them),
    thus when a stack trace is triggered, the code walking the stack
    won't see any of the JITed ones. The same for address correlation
    done from user space via reading /proc/kallsyms. This is read by
    tools like perf, but the latter is also useful for permanent live
    tracing with eBPF itself in combination with stack maps when other
    eBPF types are part of the callchain. See offwaketime example on
    dumping stack from a map.

    This work tries to tackle that issue by making the addresses and
    symbols known to the kernel. The lookup from *kernel_text_address()
    is implemented through a latched RB tree that can be read under
    RCU in fast-path that is also shared for symbol/size/offset lookup
    for a specific given address in kallsyms. The slow-path iteration
    through all symbols in the seq file done via RCU list, which holds
    a tiny fraction of all exported ksyms, usually below 0.1 percent.
    Function symbols are exported as bpf_prog_, in order to aide
    debugging and attribution. This facility is currently enabled for
    root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
    is active in any mode. The rationale behind this is that still a lot
    of systems ship with world read permissions on kallsyms thus addresses
    should not get suddenly exposed for them. If that situation gets
    much better in future, we always have the option to change the
    default on this. Likewise, unprivileged programs are not allowed
    to add entries there either, but that is less of a concern as most
    such programs types relevant in this context are for root-only anyway.
    If enabled, call graphs and stack traces will then show a correct
    attribution; one example is illustrated below, where the trace is
    now visible in tooling such as perf script --kallsyms=/proc/kallsyms
    and friends.

    Before:

    7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

    After:

    7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
    [...]
    7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Feb, 2017

1 commit


04 Feb, 2017

1 commit

  • This module is responsible for the ife encapsulation protocol
    encode/decode logics. That module can:
    - ife_encode: encode skb and reserve space for the ife meta header
    - ife_decode: decode skb and extract the meta header size
    - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
    header space.
    - ife_tlv_meta_decode - decodes one tlv entry from the packet
    - ife_tlv_meta_next - advance to the next tlv

    Reviewed-by: Jiri Pirko
    Signed-off-by: Yotam Gigi
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: Roman Mashak
    Signed-off-by: David S. Miller

    Yotam Gigi
     

25 Jan, 2017

1 commit

  • Add a general way for kernel modules to sample packets, without being tied
    to any specific subsystem. This netlink channel can be used by tc,
    iptables, etc. and allow to standardize packet sampling in the kernel.

    For every sampled packet, the psample module adds the following metadata
    fields:

    PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable

    PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable

    PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
    truncated during sampling

    PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
    user who initiated the sampling. This field allows the user to
    differentiate between several samplers working simultaneously and
    filter packets relevant to him

    PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
    sequence is kept for each group

    PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets

    PSAMPLE_ATTR_DATA - the actual packet bits

    The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
    group. In addition, add the GET_GROUPS netlink command which allows the
    user to see the current sample groups, their refcount and sequence number.
    This command currently supports only netlink dump mode.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Reviewed-by: Jamal Hadi Salim
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Yotam Gigi
     

12 Jan, 2017

1 commit


11 Jan, 2017

1 commit

  • We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
    not right when CONFIG_NET is disabled and there is no socket interface:

    warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)

    I don't know what the correct solution for this is, but simply removing
    the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
    'if NET' section avoids the warning and does not produce other build
    errors.

    Fixes: 483c4933ea09 ("cgroup: Fix CGROUP_BPF config")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

10 Jan, 2017

1 commit

  • * enable smc module loading and unloading
    * register new socket family
    * basic smc socket creation and deletion
    * use backing TCP socket to run CLC (Connection Layer Control)
    handshake of SMC protocol
    * Setup for infiniband traffic is implemented in follow-on patches.
    For now fallback to TCP socket is always used.

    Signed-off-by: Ursula Braun
    Reviewed-by: Utz Bacher
    Signed-off-by: David S. Miller

    Ursula Braun
     

02 Dec, 2016

1 commit

  • Registers new BPF program types which correspond to the LWT hooks:
    - BPF_PROG_TYPE_LWT_IN => dst_input()
    - BPF_PROG_TYPE_LWT_OUT => dst_output()
    - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

    The separate program types are required to differentiate between the
    capabilities each LWT hook allows:

    * Programs attached to dst_input() or dst_output() are restricted and
    may only read the data of an skb. This prevent modification and
    possible invalidation of already validated packet headers on receive
    and the construction of illegal headers while the IP headers are
    still being assembled.

    * Programs attached to lwtunnel_xmit() are allowed to modify packet
    content as well as prepending an L2 header via a newly introduced
    helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
    invoked after the IP header has been assembled completely.

    All BPF programs receive an skb with L3 headers attached and may return
    one of the following error codes:

    BPF_OK - Continue routing as per nexthop
    BPF_DROP - Drop skb and return EPERM
    BPF_REDIRECT - Redirect skb to device as per redirect() helper.
    (Only valid in lwtunnel_xmit() context)

    The return codes are binary compatible with their TC_ACT_
    relatives to ease compatibility.

    Signed-off-by: Thomas Graf
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Thomas Graf
     

18 Aug, 2016

1 commit

  • This patch introduces a utility for parsing application layer protocol
    messages in a TCP stream. This is a generalization of the mechanism
    implemented of Kernel Connection Multiplexor.

    The API includes a context structure, a set of callbacks, utility
    functions, and a data ready function.

    A stream parser instance is defined by a strparse structure that
    is bound to a TCP socket. The function to initialize the structure
    is:

    int strp_init(struct strparser *strp, struct sock *csk,
    struct strp_callbacks *cb);

    csk is the TCP socket being bound to and cb are the parser callbacks.

    The upper layer calls strp_tcp_data_ready when data is ready on the lower
    socket for strparser to process. This should be called from a data_ready
    callback that is set on the socket:

    void strp_tcp_data_ready(struct strparser *strp);

    A parser is bound to a TCP socket by setting data_ready function to
    strp_tcp_data_ready so that all receive indications on the socket
    go through the parser. This is assumes that sk_user_data is set to
    the strparser structure.

    There are four callbacks.
    - parse_msg is called to parse the message (returns length or error).
    - rcv_msg is called when a complete message has been received
    - read_sock_done is called when data_ready function exits
    - abort_parser is called to abort the parser

    The input to parse_msg is an skbuff which contains next message under
    construction. The backend processing of parse_msg will parse the
    application layer protocol headers to determine the length of
    the message in the stream. The possible return values are:

    >0 : indicates length of successfully parsed message
    0 : indicates more data must be received to parse the message
    -ESTRPIPE : current message should not be processed by the
    kernel, return control of the socket to userspace which
    can proceed to read the messages itself
    other < 0 : Error is parsing, give control back to userspace
    assuming that synchronzation is lost and the stream
    is unrecoverable (application expected to close TCP socket)

    In the case of error return (< 0) strparse will stop the parser
    and report and error to userspace. The application must deal
    with the error. To handle the error the strparser is unbound
    from the TCP socket. If the error indicates that the stream
    TCP socket is at recoverable point (ESTRPIPE) then the application
    can read the TCP socket to process the stream. Once the application
    has dealt with the exceptions in the stream, it may again bind the
    socket to a strparser to continue data operations.

    Note that ENODATA may be returned to the application. In this case
    parse_msg returned -ESTRPIPE, however strparser was unable to maintain
    synchronization of the stream (i.e. some of the message in question
    was already read by the parser).

    strp_pause and strp_unpause are used to provide flow control. For
    instance, if rcv_msg is called but the upper layer can't immediately
    consume the message it can hold the message and pause strparser.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert