19 Jun, 2020

1 commit

  • * tag 'v5.4.47': (2193 commits)
    Linux 5.4.47
    KVM: arm64: Save the host's PtrAuth keys in non-preemptible context
    KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
    ...

    Conflicts:
    arch/arm/boot/dts/imx6qdl.dtsi
    arch/arm/mach-imx/Kconfig
    arch/arm/mach-imx/common.h
    arch/arm/mach-imx/suspend-imx6.S
    arch/arm64/boot/dts/freescale/imx8qxp-mek.dts
    arch/powerpc/include/asm/cacheflush.h
    drivers/cpufreq/imx6q-cpufreq.c
    drivers/dma/imx-sdma.c
    drivers/edac/synopsys_edac.c
    drivers/firmware/imx/imx-scu.c
    drivers/net/ethernet/freescale/fec.h
    drivers/net/ethernet/freescale/fec_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/phy_device.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/usb/cdns3/gadget.c
    drivers/usb/dwc3/gadget.c
    include/uapi/linux/dma-buf.h

    Signed-off-by: Jason Liu

    Jason Liu
     

01 Apr, 2020

1 commit

  • commit 2c64605b590edadb3fb46d1ec6badb49e940b479 upstream.

    net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
    net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
    pkt->skb->tc_redirected = 1;
    ^~
    net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
    pkt->skb->tc_from_ingress = 1;
    ^~

    To avoid a direct dependency with tc actions from netfilter, wrap the
    redirect bits around CONFIG_NET_REDIRECT and move helpers to
    include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
    only existing client of these bits in the tree.

    This patch adds skb_set_redirected() that sets on the redirected bit
    on the skbuff, it specifies if the packet was redirect from ingress
    and resets the timestamp (timestamp reset was originally missing in the
    netfilter bugfix).

    Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
    Reported-by: noreply@ellerman.id.au
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Pablo Neira Ayuso
     

02 Dec, 2019

1 commit

  • This patch provids netlink method to configure the TSN protocols hardwares.
    TSN guaranteed packet transport with bounded low latency, low packet delay
    variation, and low packet loss by hardware and software methods.

    The three basic components of TSN are:

    1. Time synchronization: This was implement by 8021AS which base on the
    IEEE1588 precision Time Protocol. This is configured by the other way
    in kernel.
    8021AS not included in this patch.

    2. Scheduling and traffic shaping and per-stream filter policing:
    This patch support Qbv/Qci/Qbu/8021CB/Qav etc.

    3. Selection of communication paths:
    This patch not support the pure software only TSN protocols(like Qcc)
    but hardware related configuration.

    TSN Protocols supports by this patch: Qbv/Qci/Qbu/Credit-base Shaper(Qav).
    This patch verified on NXP ls1028ardb board.

    Signed-off-by: Po Liu

    Po Liu
     

18 Aug, 2019

1 commit

  • Add the basic packet trap infrastructure that allows device drivers to
    register their supported packet traps and trap groups with devlink.

    Each driver is expected to provide basic information about each
    supported trap, such as name and ID, but also the supported metadata
    types that will accompany each packet trapped via the trap. The
    currently supported metadata type is just the input port, but more will
    be added in the future. For example, output port and traffic class.

    Trap groups allow users to set the action of all member traps. In
    addition, users can retrieve per-group statistics in case per-trap
    statistics are too narrow. In the future, the trap group object can be
    extended with more attributes, such as policer settings which will limit
    the amount of traffic generated by member traps towards the CPU.

    Beside registering their packet traps with devlink, drivers are also
    expected to report trapped packets to devlink along with relevant
    metadata. devlink will maintain packets and bytes statistics for each
    packet trap and will potentially report the trapped packet with its
    metadata to user space via drop monitor netlink channel.

    The interface towards the drivers is simple and allows devlink to set
    the action of the trap. Currently, only two actions are supported:
    'trap' and 'drop'. When set to 'trap', the device is expected to provide
    the sole copy of the packet to the driver which will pass it to devlink.
    When set to 'drop', the device is expected to drop the packet and not
    send a copy to the driver. In the future, more actions can be added,
    such as 'mirror'.

    Signed-off-by: Ido Schimmel
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

18 Jun, 2019

1 commit

  • Using a bare block cipher in non-crypto code is almost always a bad idea,
    not only for security reasons (and we've seen some examples of this in
    the kernel in the past), but also for performance reasons.

    In the TCP fastopen case, we call into the bare AES block cipher one or
    two times (depending on whether the connection is IPv4 or IPv6). On most
    systems, this results in a call chain such as

    crypto_cipher_encrypt_one(ctx, dst, src)
    crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
    aesni_encrypt
    kernel_fpu_begin();
    aesni_enc(ctx, dst, src); // asm routine
    kernel_fpu_end();

    It is highly unlikely that the use of special AES instructions has a
    benefit in this case, especially since we are doing the above twice
    for IPv6 connections, instead of using a transform which can process
    the entire input in one go.

    We could switch to the cbcmac(aes) shash, which would at least get
    rid of the duplicated overhead in *some* cases (i.e., today, only
    arm64 has an accelerated implementation of cbcmac(aes), while x86 will
    end up using the generic cbcmac template wrapping the AES-NI cipher,
    which basically ends up doing exactly the above). However, in the given
    context, it makes more sense to use a light-weight MAC algorithm that
    is more suitable for the purpose at hand, such as SipHash.

    Since the output size of SipHash already matches our chosen value for
    TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
    sizes, this greatly simplifies the code as well.

    NOTE: Server farms backing a single server IP for load balancing purposes
    and sharing a single fastopen key will be adversely affected by
    this change unless all systems in the pool receive their kernel
    upgrades at the same time.

    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Ard Biesheuvel
     

21 May, 2019

1 commit


25 Mar, 2019

1 commit

  • Some drivers are becoming more dependent on NET_DEVLINK being selected
    in configuration. With upcoming compat functions, the behavior would be
    wrong in case devlink was not compiled in. So make the drivers select
    NET_DEVLINK and rely on the functions being there, not just stubs.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

27 Feb, 2019

1 commit

  • Being able to build devlink as a module causes growing pains.
    First all drivers had to add a meta dependency to make sure
    they are not built in when devlink is built as a module. Now
    we are struggling to invoke ethtool compat code reliably.

    Make devlink code built-in, users can still not build it at
    all but the dynamically loadable module option is removed.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Florian Fainelli
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

16 Feb, 2019

1 commit

  • Lightweight tunnels are L3 constructs that are used with IP/IP6.

    For example, lwtunnel_xmit is called from ip_output.c and
    ip6_output.c only.

    Make the dependency explicit at least for LWT-BPF, as now they
    call into IP routing.

    V2: added "Reported-by" below.

    Reported-by: Randy Dunlap
    Signed-off-by: Peter Oskolkov
    Acked-by: Randy Dunlap # build-tested
    Signed-off-by: Daniel Borkmann

    Peter Oskolkov
     

20 Dec, 2018

2 commits

  • This converts the bridge netfilter (calling iptables hooks from bridge)
    facility to use the extension infrastructure.

    The bridge_nf specific hooks in skb clone and free paths are removed, they
    have been replaced by the skb_ext hooks that do the same as the bridge nf
    allocations hooks did.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This adds an optional extension infrastructure, with ispec (xfrm) and
    bridge netfilter as first users.
    objdiff shows no changes if kernel is built without xfrm and br_netfilter
    support.

    The third (planned future) user is Multipath TCP which is still
    out-of-tree.
    MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
    numbers used by individual subflows.

    This DSS mapping is read/written from tcp option space on receive and
    written to tcp option space on transmitted tcp packets that are part of
    and MPTCP connection.

    Extending skb_shared_info or adding a private data field to skb fclones
    doesn't work for incoming skb, so a different DSS propagation method would
    be required for the receive side.

    mptcp has same requirements as secpath/bridge netfilter:

    1. extension memory is released when the sk_buff is free'd.
    2. data is shared after cloning an skb (clone inherits extension)
    3. adding extension to an skb will COW the extension buffer if needed.

    The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
    mapping for tx and rx processing.

    Two new members are added to sk_buff:
    1. 'active_extensions' byte (filling a hole), telling which extensions
    are available for this skb.
    This has two purposes.
    a) avoids the need to initialize the pointer.
    b) allows to "delete" an extension by clearing its bit
    value in ->active_extensions.

    While it would be possible to store the active_extensions byte
    in the extension struct instead of sk_buff, there is one problem
    with this:
    When an extension has to be disabled, we can always clear the
    bit in skb->active_extensions. But in case it would be stored in the
    extension buffer itself, we might have to COW it first, if
    we are dealing with a cloned skb. On kmalloc failure we would
    be unable to turn an extension off.

    2. extension pointer, located at the end of the sk_buff.
    If the active_extensions byte is 0, the pointer is undefined,
    it is not initialized on skb allocation.

    This adds extra code to skb clone and free paths (to deal with
    refcount/free of extension area) but this replaces similar code that
    manages skb->nf_bridge and skb->sp structs in the followup patches of
    the series.

    It is possible to add support for extensions that are not preseved on
    clones/copies.

    To do this, it would be needed to define a bitmask of all extensions that
    need copy/cow semantics, and change __skb_ext_copy() to check
    ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
    ->active_extensions to 0 on the new clone.

    This isn't done here because all extensions that get added here
    need the copy/cow semantics.

    v2:
    Allocate entire extension space using kmem_cache.
    Upside is that this allows better tracking of used memory,
    downside is that we will allocate more space than strictly needed in
    most cases (its unlikely that all extensions are active/needed at same
    time for same skb).
    The allocated memory (except the small extension header) is not cleared,
    so no additonal overhead aside from memory usage.

    Avoid atomic_dec_and_test operation on skb_ext_put()
    by using similar trick as kfree_skbmem() does with fclone_ref:
    If recount is 1, there is no concurrent user and we can free right away.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

16 Oct, 2018

1 commit

  • Add a generic sk_msg layer, and convert current sockmap and later
    kTLS over to make use of it. While sk_buff handles network packet
    representation from netdevice up to socket, sk_msg handles data
    representation from application to socket layer.

    This means that sk_msg framework spans across ULP users in the
    kernel, and enables features such as introspection or filtering
    of data with the help of BPF programs that operate on this data
    structure.

    Latter becomes in particular useful for kTLS where data encryption
    is deferred into the kernel, and as such enabling the kernel to
    perform L7 introspection and policy based on BPF for TLS connections
    where the record is being encrypted after BPF has run and came to
    a verdict. In order to get there, first step is to transform open
    coding of scatter-gather list handling into a common core framework
    that subsystems can use.

    The code itself has been split and refactored into three bigger
    pieces: i) the generic sk_msg API which deals with managing the
    scatter gather ring, providing helpers for walking and mangling,
    transferring application data from user space into it, and preparing
    it for BPF pre/post-processing, ii) the plain sock map itself
    where sockets can be attached to or detached from; these bits
    are independent of i) which can now be used also without sock
    map, and iii) the integration with plain TCP as one protocol
    to be used for processing L7 application data (later this could
    e.g. also be extended to other protocols like UDP). The semantics
    are the same with the old sock map code and therefore no change
    of user facing behavior or APIs. While pursuing this work it
    also helped finding a number of bugs in the old sockmap code
    that we've fixed already in earlier commits. The test_sockmap
    kselftest suite passes through fine as well.

    Joint work with John.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

25 Jul, 2018

1 commit


29 May, 2018

1 commit

  • The failover module provides a generic interface for paravirtual drivers
    to register a netdev and a set of ops with a failover instance. The ops
    are used as event handlers that get called to handle netdev register/
    unregister/link change/name change events on slave pci ethernet devices
    with the same mac address as the failover netdev.

    This enables paravirtual drivers to use a VF as an accelerated low latency
    datapath. It also allows migration of VMs with direct attached VFs by
    failing over to the paravirtual datapath when the VF is unplugged.

    Signed-off-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Sridhar Samudrala
     

24 May, 2018

1 commit

  • bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
    and user mode helper code that is embedded into bpfilter.ko

    The steps to build bpfilter.ko are the following:
    - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
    - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
    is converted into bpfilter_umh.o object file
    with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
    Example:
    $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
    0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
    0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
    0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
    - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko

    bpfilter_kern.c is a normal kernel module code that calls
    the fork_usermode_blob() helper to execute part of its own data
    as a user mode process.

    Notice that _binary_net_bpfilter_bpfilter_umh_start - end
    is placed into .init.rodata section, so it's freed as soon as __init
    function of bpfilter.ko is finished.
    As part of __init the bpfilter.ko does first request/reply action
    via two unix pipe provided by fork_usermode_blob() helper to
    make sure that umh is healthy. If not it will kill it via pid.

    Later bpfilter_process_sockopt() will be called from bpfilter hooks
    in get/setsockopt() to pass iptable commands into umh via bpfilter.ko

    If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
    kill umh as well.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

08 May, 2018

1 commit


04 May, 2018

1 commit


01 May, 2018

1 commit


17 Apr, 2018

1 commit

  • Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
    pages on DMA-TX completion time, which have good cross CPU
    performance, given DMA-TX completion time can happen on a remote CPU.

    Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
    Adapted page_pool code to not depend the page allocator and
    integration into struct page. The DMA mapping feature is kept,
    even-though it will not be activated/used in this patchset.

    [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

    V2: Adjustments requested by Tariq
    - Changed page_pool_create return codes, don't return NULL, only
    ERR_PTR, as this simplifies err handling in drivers.

    V4: many small improvements and cleanups
    - Add DOC comment section, that can be used by kernel-doc
    - Improve fallback mode, to work better with refcnt based recycling
    e.g. remove a WARN as pointed out by Tariq
    e.g. quicker fallback if ptr_ring is empty.

    V5: Fixed SPDX license as pointed out by Alexei

    V6: Adjustments requested by Eric Dumazet
    - Adjust ____cacheline_aligned_in_smp usage/placement
    - Move rcu_head in struct page_pool
    - Free pages quicker on destroy, minimize resources delayed an RCU period
    - Remove code for forward/backward compat ABI interface

    V8: Issues found by kbuild test robot
    - Address sparse should be static warnings
    - Only compile+link when a driver use/select page_pool,
    mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

02 Feb, 2018

1 commit

  • Pull staging/IIO updates from Greg KH:
    "Here is the big Staging and IIO driver patches for 4.16-rc1.

    There is the normal amount of new IIO drivers added, like all
    releases.

    The networking IPX and the ncpfs filesystem are moved into the staging
    tree, as they are on their way out of the kernel due to lack of use
    anymore.

    The visorbus subsystem finall has started moving out of the staging
    tree to the "real" part of the kernel, and the most and fsl-mc
    codebases are almost ready to move out, that will probably happen for
    4.17-rc1 if all goes well.

    Other than that, there is a bunch of license header cleanups in the
    tree, along with the normal amount of coding style churn that we all
    know and love for this codebase. I also got frustrated at the
    Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
    huge chunks of it that were never even being used.

    Full details of everything is in the shortlog.

    All of these patches have been in linux-next for a while with no
    reported issues"

    * tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (627 commits)
    staging: rtlwifi: remove redundant initialization of 'cfg_cmd'
    staging: rtl8723bs: remove a couple of redundant initializations
    staging: comedi: reformat lines to 80 chars or less
    staging: lustre: separate a connection destroy from free struct kib_conn
    Staging: rtl8723bs: Use !x instead of NULL comparison
    Staging: rtl8723bs: Remove dead code
    Staging: rtl8723bs: Change names to conform to the kernel code
    staging: ccree: Fix missing blank line after declaration
    staging: rtl8188eu: remove redundant initialization of 'pwrcfgcmd'
    staging: rtlwifi: remove unused RTLHALMAC_ST and RTLPHYDM_ST
    staging: fbtft: remove unused FB_TFT_SSD1325 kconfig
    staging: comedi: dt2811: remove redundant initialization of 'ns'
    staging: wilc1000: fix alignments to match open parenthesis
    staging: wilc1000: removed unnecessary defined enums typedef
    staging: wilc1000: remove unnecessary use of parentheses
    staging: rtl8192u: remove redundant initialization of 'timeout'
    staging: sm750fb: fix CamelCase for dispSet var
    staging: lustre: lnet/selftest: fix compile error on UP build
    staging: rtl8723bs: hal_com_phycfg: Remove unneeded semicolons
    staging: rts5208: Fix "seg_no" calculation in reset_ms_card()
    ...

    Linus Torvalds
     

09 Jan, 2018

1 commit


03 Jan, 2018

1 commit


28 Nov, 2017

1 commit


04 Sep, 2017

1 commit


30 Aug, 2017

1 commit

  • Add a new nsh/ directory. It currently holds only GSO functions but more
    will come: in particular, code shared by openvswitch and tc to manipulate
    NSH headers.

    For now, assume there's no hardware support for NSH segmentation. We can
    always introduce netdev->nsh_features later.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

29 Aug, 2017

2 commits

  • It's time to get rid of IRDA. It's long been broken, and no one seems
    to use it anymore. So move it to staging and after a while, we can
    delete it from there.

    To start, move the network irda core from net/irda to
    drivers/staging/irda/net/

    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: David S. Miller

    Greg Kroah-Hartman
     
  • SOCKMAP uses strparser code (compiled with Kconfig option
    CONFIG_STREAM_PARSER) to run the parser BPF program. Without this
    config option set sockmap wont be compiled. However, at the moment
    the only way to pull in the strparser code is to enable KCM.

    To resolve this create a BPF specific config option to pull
    only the strparser piece in that sockmap needs. This also
    allows folks who want to use BPF/syscall/maps but don't need
    sockmap to easily opt out.

    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

16 Jun, 2017

1 commit

  • Software implementation of transport layer security, implemented using ULP
    infrastructure. tcp proto_ops are replaced with tls equivalents of sendmsg and
    sendpage.

    Only symmetric crypto is done in the kernel, keys are passed by setsockopt
    after the handshake is complete. All control messages are supported via CMSG
    data - the actual symmetric encryption is the same, just the message type needs
    to be passed separately.

    For user API, please see Documentation patch.

    Pieces that can be shared between hw and sw implementation
    are in tls_main.c

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: Aviad Yehezkel
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

18 Feb, 2017

1 commit

  • Long standing issue with JITed programs is that stack traces from
    function tracing check whether a given address is kernel code
    through {__,}kernel_text_address(), which checks for code in core
    kernel, modules and dynamically allocated ftrace trampolines. But
    what is still missing is BPF JITed programs (interpreted programs
    are not an issue as __bpf_prog_run() will be attributed to them),
    thus when a stack trace is triggered, the code walking the stack
    won't see any of the JITed ones. The same for address correlation
    done from user space via reading /proc/kallsyms. This is read by
    tools like perf, but the latter is also useful for permanent live
    tracing with eBPF itself in combination with stack maps when other
    eBPF types are part of the callchain. See offwaketime example on
    dumping stack from a map.

    This work tries to tackle that issue by making the addresses and
    symbols known to the kernel. The lookup from *kernel_text_address()
    is implemented through a latched RB tree that can be read under
    RCU in fast-path that is also shared for symbol/size/offset lookup
    for a specific given address in kallsyms. The slow-path iteration
    through all symbols in the seq file done via RCU list, which holds
    a tiny fraction of all exported ksyms, usually below 0.1 percent.
    Function symbols are exported as bpf_prog_, in order to aide
    debugging and attribution. This facility is currently enabled for
    root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
    is active in any mode. The rationale behind this is that still a lot
    of systems ship with world read permissions on kallsyms thus addresses
    should not get suddenly exposed for them. If that situation gets
    much better in future, we always have the option to change the
    default on this. Likewise, unprivileged programs are not allowed
    to add entries there either, but that is less of a concern as most
    such programs types relevant in this context are for root-only anyway.
    If enabled, call graphs and stack traces will then show a correct
    attribution; one example is illustrated below, where the trace is
    now visible in tooling such as perf script --kallsyms=/proc/kallsyms
    and friends.

    Before:

    7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

    After:

    7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
    [...]
    7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Feb, 2017

1 commit


04 Feb, 2017

1 commit

  • This module is responsible for the ife encapsulation protocol
    encode/decode logics. That module can:
    - ife_encode: encode skb and reserve space for the ife meta header
    - ife_decode: decode skb and extract the meta header size
    - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
    header space.
    - ife_tlv_meta_decode - decodes one tlv entry from the packet
    - ife_tlv_meta_next - advance to the next tlv

    Reviewed-by: Jiri Pirko
    Signed-off-by: Yotam Gigi
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: Roman Mashak
    Signed-off-by: David S. Miller

    Yotam Gigi
     

25 Jan, 2017

1 commit

  • Add a general way for kernel modules to sample packets, without being tied
    to any specific subsystem. This netlink channel can be used by tc,
    iptables, etc. and allow to standardize packet sampling in the kernel.

    For every sampled packet, the psample module adds the following metadata
    fields:

    PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable

    PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable

    PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
    truncated during sampling

    PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
    user who initiated the sampling. This field allows the user to
    differentiate between several samplers working simultaneously and
    filter packets relevant to him

    PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
    sequence is kept for each group

    PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets

    PSAMPLE_ATTR_DATA - the actual packet bits

    The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
    group. In addition, add the GET_GROUPS netlink command which allows the
    user to see the current sample groups, their refcount and sequence number.
    This command currently supports only netlink dump mode.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Reviewed-by: Jamal Hadi Salim
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Yotam Gigi
     

12 Jan, 2017

1 commit


11 Jan, 2017

1 commit

  • We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
    not right when CONFIG_NET is disabled and there is no socket interface:

    warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)

    I don't know what the correct solution for this is, but simply removing
    the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
    'if NET' section avoids the warning and does not produce other build
    errors.

    Fixes: 483c4933ea09 ("cgroup: Fix CGROUP_BPF config")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

10 Jan, 2017

1 commit

  • * enable smc module loading and unloading
    * register new socket family
    * basic smc socket creation and deletion
    * use backing TCP socket to run CLC (Connection Layer Control)
    handshake of SMC protocol
    * Setup for infiniband traffic is implemented in follow-on patches.
    For now fallback to TCP socket is always used.

    Signed-off-by: Ursula Braun
    Reviewed-by: Utz Bacher
    Signed-off-by: David S. Miller

    Ursula Braun
     

02 Dec, 2016

1 commit

  • Registers new BPF program types which correspond to the LWT hooks:
    - BPF_PROG_TYPE_LWT_IN => dst_input()
    - BPF_PROG_TYPE_LWT_OUT => dst_output()
    - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

    The separate program types are required to differentiate between the
    capabilities each LWT hook allows:

    * Programs attached to dst_input() or dst_output() are restricted and
    may only read the data of an skb. This prevent modification and
    possible invalidation of already validated packet headers on receive
    and the construction of illegal headers while the IP headers are
    still being assembled.

    * Programs attached to lwtunnel_xmit() are allowed to modify packet
    content as well as prepending an L2 header via a newly introduced
    helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
    invoked after the IP header has been assembled completely.

    All BPF programs receive an skb with L3 headers attached and may return
    one of the following error codes:

    BPF_OK - Continue routing as per nexthop
    BPF_DROP - Drop skb and return EPERM
    BPF_REDIRECT - Redirect skb to device as per redirect() helper.
    (Only valid in lwtunnel_xmit() context)

    The return codes are binary compatible with their TC_ACT_
    relatives to ease compatibility.

    Signed-off-by: Thomas Graf
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Thomas Graf
     

18 Aug, 2016

1 commit

  • This patch introduces a utility for parsing application layer protocol
    messages in a TCP stream. This is a generalization of the mechanism
    implemented of Kernel Connection Multiplexor.

    The API includes a context structure, a set of callbacks, utility
    functions, and a data ready function.

    A stream parser instance is defined by a strparse structure that
    is bound to a TCP socket. The function to initialize the structure
    is:

    int strp_init(struct strparser *strp, struct sock *csk,
    struct strp_callbacks *cb);

    csk is the TCP socket being bound to and cb are the parser callbacks.

    The upper layer calls strp_tcp_data_ready when data is ready on the lower
    socket for strparser to process. This should be called from a data_ready
    callback that is set on the socket:

    void strp_tcp_data_ready(struct strparser *strp);

    A parser is bound to a TCP socket by setting data_ready function to
    strp_tcp_data_ready so that all receive indications on the socket
    go through the parser. This is assumes that sk_user_data is set to
    the strparser structure.

    There are four callbacks.
    - parse_msg is called to parse the message (returns length or error).
    - rcv_msg is called when a complete message has been received
    - read_sock_done is called when data_ready function exits
    - abort_parser is called to abort the parser

    The input to parse_msg is an skbuff which contains next message under
    construction. The backend processing of parse_msg will parse the
    application layer protocol headers to determine the length of
    the message in the stream. The possible return values are:

    >0 : indicates length of successfully parsed message
    0 : indicates more data must be received to parse the message
    -ESTRPIPE : current message should not be processed by the
    kernel, return control of the socket to userspace which
    can proceed to read the messages itself
    other < 0 : Error is parsing, give control back to userspace
    assuming that synchronzation is lost and the stream
    is unrecoverable (application expected to close TCP socket)

    In the case of error return (< 0) strparse will stop the parser
    and report and error to userspace. The application must deal
    with the error. To handle the error the strparser is unbound
    from the TCP socket. If the error indicates that the stream
    TCP socket is at recoverable point (ESTRPIPE) then the application
    can read the TCP socket to process the stream. Once the application
    has dealt with the exceptions in the stream, it may again bind the
    socket to a strparser to continue data operations.

    Note that ENODATA may be returned to the application. In this case
    parse_msg returned -ESTRPIPE, however strparser was unable to maintain
    synchronization of the stream (i.e. some of the message in question
    was already read by the parser).

    strp_pause and strp_unpause are used to provide flow control. For
    instance, if rcv_msg is called but the upper layer can't immediately
    consume the message it can hold the message and pause strparser.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

20 Jul, 2016

1 commit

  • NCSI spec (DSP0222) defines several objects: package, channel, mode,
    filter, version and statistics etc. This introduces the data structs
    to represent those objects and implement functions to manage them.
    Also, this introduces CONFIG_NET_NCSI for the newly implemented NCSI
    stack.

    * The user (e.g. netdev driver) dereference NCSI device by
    "struct ncsi_dev", which is embedded to "struct ncsi_dev_priv".
    The later one is used by NCSI stack internally.
    * Every NCSI device can have multiple packages simultaneously, up
    to 8 packages. It's represented by "struct ncsi_package" and
    identified by 3-bits ID.
    * Every NCSI package can have multiple channels, up to 32. It's
    represented by "struct ncsi_channel" and identified by 5-bits ID.
    * Every NCSI channel has version, statistics, various modes and
    filters. They are represented by "struct ncsi_channel_version",
    "struct ncsi_channel_stats", "struct ncsi_channel_mode" and
    "struct ncsi_channel_filter" separately.
    * Apart from AEN (Asynchronous Event Notification), the NCSI stack
    works in terms of command and response. This introduces "struct
    ncsi_req" to represent a complete NCSI transaction made of NCSI
    request and response.

    link: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
    Signed-off-by: Gavin Shan
    Acked-by: Joel Stanley
    Signed-off-by: David S. Miller

    Gavin Shan
     

17 May, 2016

2 commits

  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.

    Current cBPF ones:

    # git grep -n HAVE_CBPF_JIT arch/
    arch/arm/Kconfig:44: select HAVE_CBPF_JIT
    arch/mips/Kconfig:18: select HAVE_CBPF_JIT if !CPU_MICROMIPS
    arch/powerpc/Kconfig:129: select HAVE_CBPF_JIT
    arch/sparc/Kconfig:35: select HAVE_CBPF_JIT

    Current eBPF ones:

    # git grep -n HAVE_EBPF_JIT arch/
    arch/arm64/Kconfig:61: select HAVE_EBPF_JIT
    arch/s390/Kconfig:126: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
    arch/x86/Kconfig:94: select HAVE_EBPF_JIT if X86_64

    Later code also needs this facility to check for eBPF JITs.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann