25 Mar, 2015

1 commit

  • If vlan offloading takes place then vlan header is removed from frame
    and its contents, both vlan_tci and vlan_proto, is available to user
    space via TPACKET interface. However, only vlan_tci can be used in BPF
    filters.

    This commit introduces a new BPF extension. It makes possible to load
    the value of vlan_proto (vlan TPID) to register A. Support for classic
    BPF and eBPF is being added, analogous to skb->protocol.

    Cc: Daniel Borkmann
    Cc: Alexei Starovoitov
    Cc: Jiri Pirko

    Signed-off-by: Michal Sekletar
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Michal Sekletar
     

14 Jan, 2015

1 commit


11 Oct, 2014

1 commit


09 Oct, 2014

1 commit

  • Pull networking updates from David Miller:
    "Most notable changes in here:

    1) By far the biggest accomplishment, thanks to a large range of
    contributors, is the addition of multi-send for transmit. This is
    the result of discussions back in Chicago, and the hard work of
    several individuals.

    Now, when the ->ndo_start_xmit() method of a driver sees
    skb->xmit_more as true, it can choose to defer the doorbell
    telling the driver to start processing the new TX queue entires.

    skb->xmit_more means that the generic networking is guaranteed to
    call the driver immediately with another SKB to send.

    There is logic added to the qdisc layer to dequeue multiple
    packets at a time, and the handling mis-predicted offloads in
    software is now done with no locks held.

    Finally, pktgen is extended to have a "burst" parameter that can
    be used to test a multi-send implementation.

    Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
    virtio_net

    Adding support is almost trivial, so export more drivers to
    support this optimization soon.

    I want to thank, in no particular or implied order, Jesper
    Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
    Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
    David Tat, Hannes Frederic Sowa, and Rusty Russell.

    2) PTP and timestamping support in bnx2x, from Michal Kalderon.

    3) Allow adjusting the rx_copybreak threshold for a driver via
    ethtool, and add rx_copybreak support to enic driver. From
    Govindarajulu Varadarajan.

    4) Significant enhancements to the generic PHY layer and the bcm7xxx
    driver in particular (EEE support, auto power down, etc.) from
    Florian Fainelli.

    5) Allow raw buffers to be used for flow dissection, allowing drivers
    to determine the optimal "linear pull" size for devices that DMA
    into pools of pages. The objective is to get exactly the
    necessary amount of headers into the linear SKB area pre-pulled,
    but no more. The new interface drivers use is eth_get_headlen().
    From WANG Cong, with driver conversions (several had their own
    by-hand duplicated implementations) by Alexander Duyck and Eric
    Dumazet.

    6) Support checksumming more smoothly and efficiently for
    encapsulations, and add "foo over UDP" facility. From Tom
    Herbert.

    7) Add Broadcom SF2 switch driver to DSA layer, from Florian
    Fainelli.

    8) eBPF now can load programs via a system call and has an extensive
    testsuite. Alexei Starovoitov and Daniel Borkmann.

    9) Major overhaul of the packet scheduler to use RCU in several major
    areas such as the classifiers and rate estimators. From John
    Fastabend.

    10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
    Duyck.

    11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
    Dumazet.

    12) Add Datacenter TCP congestion control algorithm support, From
    Florian Westphal.

    13) Reorganize sk_buff so that __copy_skb_header() is significantly
    faster. From Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
    netlabel: directly return netlbl_unlabel_genl_init()
    net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
    net: description of dma_cookie cause make xmldocs warning
    cxgb4: clean up a type issue
    cxgb4: potential shift wrapping bug
    i40e: skb->xmit_more support
    net: fs_enet: Add NAPI TX
    net: fs_enet: Remove non NAPI RX
    r8169:add support for RTL8168EP
    net_sched: copy exts->type in tcf_exts_change()
    wimax: convert printk to pr_foo()
    af_unix: remove 0 assignment on static
    ipv6: Do not warn for informational ICMP messages, regardless of type.
    Update Intel Ethernet Driver maintainers list
    bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
    tipc: fix bug in multicast congestion handling
    net: better IFF_XMIT_DST_RELEASE support
    net/mlx4_en: remove NETDEV_TX_BUSY
    3c59x: fix bad split of cpu_to_le32(pci_map_single())
    net: bcmgenet: fix Tx ring priority programming
    ...

    Linus Torvalds
     

08 Oct, 2014

1 commit

  • Pull arm64 updates from Catalin Marinas:
    - eBPF JIT compiler for arm64
    - CPU suspend backend for PSCI (firmware interface) with standard idle
    states defined in DT (generic idle driver to be merged via a
    different tree)
    - Support for CONFIG_DEBUG_SET_MODULE_RONX
    - Support for unmapped cpu-release-addr (outside kernel linear mapping)
    - set_arch_dma_coherent_ops() implemented and bus notifiers removed
    - EFI_STUB improvements when base of DRAM is occupied
    - Typos in KGDB macros
    - Clean-up to (partially) allow kernel building with LLVM
    - Other clean-ups (extern keyword, phys_addr_t usage)

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (51 commits)
    arm64: Remove unneeded extern keyword
    ARM64: make of_device_ids const
    arm64: Use phys_addr_t type for physical address
    aarch64: filter $x from kallsyms
    arm64: Use DMA_ERROR_CODE to denote failed allocation
    arm64: Fix typos in KGDB macros
    arm64: insn: Add return statements after BUG_ON()
    arm64: debug: don't re-enable debug exceptions on return from el1_dbg
    Revert "arm64: dmi: Add SMBIOS/DMI support"
    arm64: Implement set_arch_dma_coherent_ops() to replace bus notifiers
    of: amba: use of_dma_configure for AMBA devices
    arm64: dmi: Add SMBIOS/DMI support
    arm64: Correct ftrace calls to aarch64_insn_gen_branch_imm()
    arm64:mm: initialize max_mapnr using function set_max_mapnr
    setup: Move unmask of async interrupts after possible earlycon setup
    arm64: LLVMLinux: Fix inline arm64 assembly for use with clang
    arm64: pageattr: Correctly adjust unaligned start addresses
    net: bpf: arm64: fix module memory leak when JIT image build fails
    arm64: add PSCI CPU_SUSPEND based cpu_suspend support
    arm64: kernel: introduce cpu_init_idle CPU operation
    ...

    Linus Torvalds
     

27 Sep, 2014

2 commits

  • this patch adds all of eBPF verfier documentation and empty bpf_check()

    The end goal for the verifier is to statically check safety of the program.

    Verifier will catch:
    - loops
    - out of range jumps
    - unreachable instructions
    - invalid instructions
    - uninitialized register access
    - uninitialized stack access
    - misaligned stack access
    - out of range stack access
    - invalid calling convention

    More details in Documentation/networking/filter.txt

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • BPF syscall is a multiplexor for a range of different operations on eBPF.
    This patch introduces syscall with single command to create a map.
    Next patch adds commands to access maps.

    'maps' is a generic storage of different types for sharing data between kernel
    and userspace.

    Userspace example:
    /* this syscall wrapper creates a map with given type and attributes
    * and returns map_fd on success.
    * use close(map_fd) to delete the map
    */
    int bpf_create_map(enum bpf_map_type map_type, int key_size,
    int value_size, int max_entries)
    {
    union bpf_attr attr = {
    .map_type = map_type,
    .key_size = key_size,
    .value_size = value_size,
    .max_entries = max_entries
    };

    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
    }

    'union bpf_attr' is backwards compatible with future extensions.

    More details in Documentation/networking/filter.txt and in manpage

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

24 Sep, 2014

1 commit


11 Sep, 2014

1 commit


10 Sep, 2014

1 commit

  • add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
    All previous instructions were 8-byte. This is first 16-byte instruction.
    Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
    insn[0].code = BPF_LD | BPF_DW | BPF_IMM
    insn[0].dst_reg = destination register
    insn[0].imm = lower 32-bit
    insn[1].code = 0
    insn[1].imm = upper 32-bit
    All unused fields must be zero.

    Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
    which loads 32-bit immediate value into a register.

    x64 JITs it as single 'movabsq %rax, imm64'
    arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn

    Note that old eBPF programs are binary compatible with new interpreter.

    It helps eBPF programs load 64-bit constant into a register with one
    instruction instead of using two registers and 4 instructions:
    BPF_MOV32_IMM(R1, imm32)
    BPF_ALU64_IMM(BPF_LSH, R1, 32)
    BPF_MOV32_IMM(R2, imm32)
    BPF_ALU64_REG(BPF_OR, R1, R2)

    User space generated programs will use this instruction to load constants only.

    To tell kernel that user space needs a pointer the _pseudo_ variant of
    this instruction may be added later, which will use extra bits of encoding
    to indicate what type of pointer user space is asking kernel to provide.
    For example 'off' or 'src_reg' fields can be used for such purpose.
    src_reg = 1 could mean that user space is asking kernel to validate and
    load in-kernel map pointer.
    src_reg = 2 could mean that user space needs readonly data section pointer
    src_reg = 3 could mean that user space needs a pointer to per-cpu local data
    All such future pseudo instructions will not be carrying the actual pointer
    as part of the instruction, but rather will be treated as a request to kernel
    to provide one. The kernel will verify the request_for_a_pointer, then
    will drop _pseudo_ marking and will store actual internal pointer inside
    the instruction, so the end result is the interpreter and JITs never
    see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
    loads 64-bit immediate into a register. User space never operates on direct
    pointers and verifier can easily recognize request_for_pointer vs other
    instructions.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

08 Sep, 2014

1 commit

  • The JIT compiler emits A64 instructions. It supports eBPF only.
    Legacy BPF is supported thanks to conversion by BPF core.

    JIT is enabled in the same way as for other architectures:

    echo 1 > /proc/sys/net/core/bpf_jit_enable

    Or for additional compiler output:

    echo 2 > /proc/sys/net/core/bpf_jit_enable

    See Documentation/networking/filter.txt for more information.

    The implementation passes all 57 tests in lib/test_bpf.c
    on ARMv8 Foundation Model :) Also tested by Will on Juno platform.

    Signed-off-by: Zi Shen Lim
    Acked-by: Alexei Starovoitov
    Acked-by: Will Deacon
    Signed-off-by: Will Deacon

    Zi Shen Lim
     

03 Aug, 2014

2 commits

  • clean up names related to socket filtering and bpf in the following way:
    - everything that deals with sockets keeps 'sk_*' prefix
    - everything that is pure BPF is changed to 'bpf_*' prefix

    split 'struct sk_filter' into
    struct sk_filter {
    atomic_t refcnt;
    struct rcu_head rcu;
    struct bpf_prog *prog;
    };
    and
    struct bpf_prog {
    u32 jited:1,
    len:31;
    struct sock_fprog_kern *orig_prog;
    unsigned int (*bpf_func)(const struct sk_buff *skb,
    const struct bpf_insn *filter);
    union {
    struct sock_filter insns[0];
    struct bpf_insn insnsi[0];
    struct work_struct work;
    };
    };
    so that 'struct bpf_prog' can be used independent of sockets and cleans up
    'unattached' bpf use cases

    split SK_RUN_FILTER macro into:
    SK_RUN_FILTER to be used with 'struct sk_filter *' and
    BPF_PROG_RUN to be used with 'struct bpf_prog *'

    __sk_filter_release(struct sk_filter *) gains
    __bpf_prog_release(struct bpf_prog *) helper function

    also perform related renames for the functions that work
    with 'struct bpf_prog *', since they're on the same lines:

    sk_filter_size -> bpf_prog_size
    sk_filter_select_runtime -> bpf_prog_select_runtime
    sk_filter_free -> bpf_prog_free
    sk_unattached_filter_create -> bpf_prog_create
    sk_unattached_filter_destroy -> bpf_prog_destroy
    sk_store_orig_filter -> bpf_prog_store_orig_filter
    sk_release_orig_filter -> bpf_release_orig_filter
    __sk_migrate_filter -> bpf_migrate_filter
    __sk_prepare_filter -> bpf_prepare_filter

    API for attaching classic BPF to a socket stays the same:
    sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
    and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
    which is used by sockets, tun, af_packet

    API for 'unattached' BPF programs becomes:
    bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
    and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
    which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • trivial rename to indicate that this functions performs classic BPF checking

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

12 Jun, 2014

2 commits


11 Jun, 2014

1 commit

  • The macro 'A' used in internal BPF interpreter:
    #define A regs[insn->a_reg]
    was easily confused with the name of classic BPF register 'A', since
    'A' would mean two different things depending on context.

    This patch is trying to clean up the naming and clarify its usage in the
    following way:

    - A and X are names of two classic BPF registers

    - BPF_REG_A denotes internal BPF register R0 used to map classic register A
    in internal BPF programs generated from classic

    - BPF_REG_X denotes internal BPF register R7 used to map classic register X
    in internal BPF programs generated from classic

    - internal BPF instruction format:
    struct sock_filter_int {
    __u8 code; /* opcode */
    __u8 dst_reg:4; /* dest register */
    __u8 src_reg:4; /* source register */
    __s16 off; /* signed offset */
    __s32 imm; /* signed immediate constant */
    };

    - BPF_X/BPF_K is 1 bit used to encode source operand of instruction
    In classic:
    BPF_X - means use register X as source operand
    BPF_K - means use 32-bit immediate as source operand
    In internal:
    BPF_X - means use 'src_reg' register as source operand
    BPF_K - means use 32-bit immediate as source operand

    Suggested-by: Chema Gonzalez
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Chema Gonzalez
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

24 May, 2014

2 commits

  • Conflicts:
    drivers/net/bonding/bond_alb.c
    drivers/net/ethernet/altera/altera_msgdma.c
    drivers/net/ethernet/altera/altera_sgdma.c
    net/ipv6/xfrm6_output.c

    Several cases of overlapping changes.

    The xfrm6_output.c has a bug fix which overlaps the renaming
    of skb->local_df to skb->ignore_df.

    In the Altera TSE driver cases, the register access cleanups
    in net-next overlapped with bug fixes done in net.

    Similarly a bug fix to send ALB packets in the bonding driver using
    the right source address overlaps with cleanups in net-next.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Mention the recently added test suite in the documentation file.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

23 May, 2014

1 commit


05 May, 2014

1 commit


23 Apr, 2014

1 commit

  • Added a new ancillary load (bpf call in eBPF parlance) that produces
    a 32-bit random number. We are implementing it as an ancillary load
    (instead of an ISA opcode) because (a) it is simpler, (b) allows easy
    JITing, and (c) seems more in line with generic ISAs that do not have
    "get a random number" as a instruction, but as an OS call.

    The main use for this ancillary load is to perform random packet sampling.

    Signed-off-by: Chema Gonzalez
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Chema Gonzalez
     

31 Mar, 2014

1 commit


12 Dec, 2013

1 commit

  • This patch significantly updates the BPF documentation and describes
    its internal architecture, Linux extensions, and handling of the
    kernel's BPF and JIT engine, plus documents how development can be
    facilitated with the help of bpf_dbg, bpf_asm, bpf_jit_disasm.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

17 Jan, 2013

1 commit

  • While a privileged program can open a raw socket, attach some
    restrictive filter and drop its privileges (or send the socket to an
    unprivileged program through some Unix socket), the filter can still
    be removed or modified by the unprivileged program. This commit adds a
    socket option to lock the filter (SO_LOCK_FILTER) preventing any
    modification of a socket filter program.

    This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even
    root is not allowed change/drop the filter.

    The state of the lock can be read with getsockopt(). No error is
    triggered if the state is not changed. -EPERM is returned when a user
    tries to remove the lock or to change/remove the filter while the lock
    is active. The check is done directly in sk_attach_filter() and
    sk_detach_filter() and does not affect only setsockopt() syscall.

    Signed-off-by: Vincent Bernat
    Signed-off-by: David S. Miller

    Vincent Bernat