28 Jul, 2016

1 commit

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     

20 Jul, 2016

3 commits

  • For forwarding to be effective, XDP programs should be allowed to
    rewrite packet data.

    This requires that the drivers supporting XDP must all map the packet
    memory as TODEVICE or BIDIRECTIONAL before invoking the program.

    Signed-off-by: Brenden Blanco
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Brenden Blanco
     
  • Add a new bpf prog type that is intended to run in early stages of the
    packet rx path. Only minimal packet metadata will be available, hence a
    new context type, struct xdp_md, is exposed to userspace. So far only
    expose the packet start and end pointers, and only in read mode.

    An XDP program must return one of the well known enum values, all other
    return codes are reserved for future use. Unfortunately, this
    restriction is hard to enforce at verification time, so take the
    approach of warning at runtime when such programs are encountered. Out
    of bounds return codes should alias to XDP_ABORTED.

    Signed-off-by: Brenden Blanco
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Brenden Blanco
     
  • A subsystem may need to store many copies of a bpf program, each
    deserving its own reference. Rather than requiring the caller to loop
    one by one (with possible mid-loop failure), add a bulk bpf_prog_add
    api.

    Signed-off-by: Brenden Blanco
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Brenden Blanco
     

17 Jul, 2016

1 commit

  • Should have been obvious, only called from bpf() syscall via map_update_elem()
    that calls bpf_fd_array_map_update_elem() under RCU read lock and thus this
    must also be in GFP_ATOMIC, of course.

    Fixes: 3b1efb196eee ("bpf, maps: flush own entries on perf map release")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Jul, 2016

1 commit

  • This work addresses a couple of issues bpf_skb_event_output()
    helper currently has: i) We need two copies instead of just a
    single one for the skb data when it should be part of a sample.
    The data can be non-linear and thus needs to be extracted via
    bpf_skb_load_bytes() helper first, and then copied once again
    into the ring buffer slot. ii) Since bpf_skb_load_bytes()
    currently needs to be used first, the helper needs to see a
    constant size on the passed stack buffer to make sure BPF
    verifier can do sanity checks on it during verification time.
    Thus, just passing skb->len (or any other non-constant value)
    wouldn't work, but changing bpf_skb_load_bytes() is also not
    the proper solution, since the two copies are generally still
    needed. iii) bpf_skb_load_bytes() is just for rather small
    buffers like headers, since they need to sit on the limited
    BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
    this work improves the bpf_skb_event_output() helper to address
    all 3 at once.

    We can make use of the passed in skb context that we have in
    the helper anyway, and use some of the reserved flag bits as
    a length argument. The helper will use the new __output_custom()
    facility from perf side with bpf_skb_copy() as callback helper
    to walk and extract the data. It will pass the data for setup
    to bpf_event_output(), which generates and pushes the raw record
    with an additional frag part. The linear data used in the first
    frag of the record serves as programmatically defined meta data
    passed along with the appended sample.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

12 Jul, 2016

1 commit

  • The Kconfig currently controlling compilation of this code is:

    init/Kconfig:config BPF_SYSCALL
    init/Kconfig: bool "Enable bpf() system call"

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the couple traces of modular infrastructure use, so that
    when reading the driver there is no doubt it is builtin-only.

    Note that MODULE_ALIAS is a no-op for non-modular code.

    We replace module.h with init.h since the file does use __init.

    Cc: Alexei Starovoitov
    Cc: netdev@vger.kernel.org
    Signed-off-by: Paul Gortmaker
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Paul Gortmaker
     

07 Jul, 2016

1 commit


02 Jul, 2016

4 commits

  • Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
    belongs to a descendant of a cgroup2. It is similar to the
    feature added in netfilter:
    commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")

    The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
    which will be used by the bpf_skb_in_cgroup.

    Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
    and bpf_skb_in_cgroup() are always used together.

    Signed-off-by: Martin KaFai Lau
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Tejun Heo
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
    To update an element, the caller is expected to obtain a cgroup2 backed
    fd by open(cgroup2_dir) and then update the array with that fd.

    Signed-off-by: Martin KaFai Lau
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Tejun Heo
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • Since bpf_prog_get() and program type check is used in a couple of places,
    refactor this into a small helper function that we can make use of. Since
    the non RO prog->aux part is not used in performance critical paths and a
    program destruction via RCU is rather very unlikley when doing the put, we
    shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
    check, but actually not taking the ref at all (due to being in fdget() /
    fdput() section of the bpf fd) is even cleaner and makes the diff smaller
    as well, so just go for that. Callsites are changed to make use of the new
    helper where possible.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Jann Horn reported following analysis that could potentially result
    in a very hard to trigger (if not impossible) UAF race, to quote his
    event timeline:

    - Set up a process with threads T1, T2 and T3
    - Let T1 set up a socket filter F1 that invokes another filter F2
    through a BPF map [tail call]
    - Let T1 trigger the socket filter via a unix domain socket write,
    don't wait for completion
    - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
    - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
    - Let T3 close the file descriptor for F2, dropping the reference
    count of F2 to 2
    - At this point, T1 should have looked up F2 from the map, but not
    finished executing it
    - Let T3 remove F2 from the BPF map, dropping the reference count of
    F2 to 1
    - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
    the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
    via schedule_work()
    - At this point, the BPF program could be freed
    - BPF execution is still running in a freed BPF program

    While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
    event fd we're doing the syscall on doesn't disappear from underneath us
    for whole syscall time, it may not be the case for the bpf fd used as
    an argument only after we did the put. It needs to be a valid fd pointing
    to a BPF program at the time of the call to make the bpf_prog_get() and
    while T2 gets preempted, F2 must have dropped reference to 1 on the other
    CPU. The fput() from the close() in T3 should also add additionally delay
    to the reference drop via exit_task_work() when bpf_prog_release() gets
    called as well as scheduling bpf_prog_free_deferred().

    That said, it makes nevertheless sense to move the BPF prog destruction
    generally after RCU grace period to guarantee that such scenario above,
    but also others as recently fixed in ceb56070359b ("bpf, perf: delay release
    of BPF prog after grace period") with regards to tail calls won't happen.
    Integrating bpf_prog_free_deferred() directly into the RCU callback is
    not allowed since the invocation might happen from either softirq or
    process context, so we're not permitted to block. Reviewing all bpf_prog_put()
    invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
    their destruction) with call_rcu() look good to me.

    Since we don't know whether at the time of attaching the program, we're
    already part of a tail call map, we need to use RCU variant. However, due
    to this, there won't be severely more stress on the RCU callback queue:
    situations with above bpf_prog_get() and bpf_prog_put() combo in practice
    normally won't lead to releases, but even if they would, enough effort/
    cycles have to be put into loading a BPF program into the kernel already.

    Reported-by: Jann Horn
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Jun, 2016

3 commits

  • Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
    instead of the raw variant. This allows for preemption checks when we
    have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
    need to keep the raw variant for socket filters, but we can reuse the
    helper that is already there from cBPF side.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Some minor cleanups: i) Remove the unlikely() from fd array map lookups
    and let the CPU branch predictor do its job, scenarios where there is not
    always a map entry are very well valid. ii) Move the attribute type check
    in the bpf_perf_event_read() helper a bit earlier so it's consistent wrt
    checks with bpf_perf_event_output() helper as well. iii) remove some
    comments that are self-documenting in kprobe_prog_is_valid_access() and
    therefore make it consistent to tp_prog_is_valid_access() as well.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Several cases of overlapping changes, except the packet scheduler
    conflicts which deal with the addition of the free list parameter
    to qdisc_enqueue().

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Jun, 2016

4 commits

  • The behavior of perf event arrays are quite different from all
    others as they are tightly coupled to perf event fds, f.e. shown
    recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
    to use struct file") to make refcounting on perf event more robust.
    A remaining issue that the current code still has is that since
    additions to the perf event array take a reference on the struct
    file via perf_event_get() and are only released via fput() (that
    cleans up the perf event eventually via perf_event_release_kernel())
    when the element is either manually removed from the map from user
    space or automatically when the last reference on the perf event
    map is dropped. However, this leads us to dangling struct file's
    when the map gets pinned after the application owning the perf
    event descriptor exits, and since the struct file reference will
    in such case only be manually dropped or via pinned file removal,
    it leads to the perf event living longer than necessary, consuming
    needlessly resources for that time.

    Relations between perf event fds and bpf perf event map fds can be
    rather complex. F.e. maps can act as demuxers among different perf
    event fds that can possibly be owned by different threads and based
    on the index selection from the program, events get dispatched to
    one of the per-cpu fd endpoints. One perf event fd (or, rather a
    per-cpu set of them) can also live in multiple perf event maps at
    the same time, listening for events. Also, another requirement is
    that perf event fds can get closed from application side after they
    have been attached to the perf event map, so that on exit perf event
    map will take care of dropping their references eventually. Likewise,
    when such maps are pinned, the intended behavior is that a user
    application does bpf_obj_get(), puts its fds in there and on exit
    when fd is released, they are dropped from the map again, so the map
    acts rather as connector endpoint. This also makes perf event maps
    inherently different from program arrays as described in more detail
    in commit c9da161c6517 ("bpf: fix clearing on persistent program
    array maps").

    To tackle this, map entries are marked by the map struct file that
    added the element to the map. And when the last reference to that map
    struct file is released from user space, then the tracked entries
    are purged from the map. This is okay, because new map struct files
    instances resp. frontends to the anon inode are provided via
    bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
    for retrieving a pinned map, but also when an initial instance is
    created via map_create(). The rest is resolved by the vfs layer
    automatically for us by keeping reference count on the map's struct
    file. Any concurrent updates on the map slot are fine as well, it
    just means that perf_event_fd_array_release() needs to delete less
    of its own entires.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This patch extends map_fd_get_ptr() callback that is used by fd array
    maps, so that struct file pointer from the related map can be passed
    in. It's safe to remove map_update_elem() callback for the two maps since
    this is only allowed from syscall side, but not from eBPF programs for these
    two map types. Like in per-cpu map case, bpf_fd_array_map_update_elem()
    needs to be called directly here due to the extra argument.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Add a release callback for maps that is invoked when the last
    reference to its struct file is gone and the struct file about
    to be released by vfs. The handler will be used by fd array maps.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The ctx structure passed into bpf programs is different depending on bpf
    program type. The verifier incorrectly marked ctx->data and ctx->data_end
    access based on ctx offset only. That caused loads in tracing programs
    int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
    to be incorrectly marked as PTR_TO_PACKET which later caused verifier
    to reject the program that was actually valid in tracing context.
    Fix this by doing program type specific matching of ctx offsets.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Reported-by: Sasha Goldshtein
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

08 Jun, 2016

1 commit


01 Jun, 2016

1 commit

  • Pull networking fixes from David Miller:

    1) Fix negative error code usage in ATM layer, from Stefan Hajnoczi.

    2) If CONFIG_SYSCTL is disabled, the default TTL is not initialized
    properly. From Ezequiel Garcia.

    3) Missing spinlock init in mvneta driver, from Gregory CLEMENT.

    4) Missing unlocks in hwmb error paths, also from Gregory CLEMENT.

    5) Fix deadlock on team->lock when propagating features, from Ivan
    Vecera.

    6) Work around buffer offset hw bug in alx chips, from Feng Tang.

    7) Fix double listing of SCTP entries in sctp_diag dumps, from Xin
    Long.

    8) Various statistics bug fixes in mlx4 from Eric Dumazet.

    9) Fix some randconfig build errors wrt fou ipv6 from Arnd Bergmann.

    10) All of l2tp was namespace aware, but the ipv6 support code was not
    doing so. From Shmulik Ladkani.

    11) Handle on-stack hrtimers properly in pktgen, from Guenter Roeck.

    12) Propagate MAC changes properly through VLAN devices, from Mike
    Manning.

    13) Fix memory leak in bnx2x_init_one(), from Vitaly Kuznetsov.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
    sfc: Track RPS flow IDs per channel instead of per function
    usbnet: smsc95xx: fix link detection for disabled autonegotiation
    virtio_net: fix virtnet_open and virtnet_probe competing for try_fill_recv
    bnx2x: avoid leaking memory on bnx2x_init_one() failures
    fou: fix IPv6 Kconfig options
    openvswitch: update checksum in {push,pop}_mpls
    sctp: sctp_diag should dump sctp socket type
    net: fec: update dirty_tx even if no skb
    vlan: Propagate MAC address to VLANs
    atm: iphase: off by one in rx_pkt()
    atm: firestream: add more reserved strings
    vxlan: Accept user specified MTU value when create new vxlan link
    net: pktgen: Call destroy_hrtimer_on_stack()
    timer: Export destroy_hrtimer_on_stack()
    net: l2tp: Make l2tp_ip6 namespace aware
    Documentation: ip-sysctl.txt: clarify secure_redirects
    sfc: use flow dissector helpers for aRFS
    ieee802154: fix logic error in ieee802154_llsec_parse_dev_addr
    net: nps_enet: Disable interrupts before napi reschedule
    net/lapb: tuse %*ph to dump buffers
    ...

    Linus Torvalds
     

30 May, 2016

1 commit

  • Additionally to being able to control the system wide maximum depth via
    /proc/sys/kernel/perf_event_max_stack, now we are able to ask for
    different depths per event, using perf_event_attr.sample_max_stack for
    that.

    This uses an u16 hole at the end of perf_event_attr, that, when
    perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
    sample_max_stack is zero, means use perf_event_max_stack, otherwise
    it'll be bounds checked under callchain_mutex.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

26 May, 2016

1 commit

  • Pull perf updates from Ingo Molnar:
    "Mostly tooling and PMU driver fixes, but also a number of late updates
    such as the reworking of the call-chain size limiting logic to make
    call-graph recording more robust, plus tooling side changes for the
    new 'backwards ring-buffer' extension to the perf ring-buffer"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    perf record: Read from backward ring buffer
    perf record: Rename variable to make code clear
    perf record: Prevent reading invalid data in record__mmap_read
    perf evlist: Add API to pause/resume
    perf trace: Use the ptr->name beautifier as default for "filename" args
    perf trace: Use the fd->name beautifier as default for "fd" args
    perf report: Add srcline_from/to branch sort keys
    perf evsel: Record fd into perf_mmap
    perf evsel: Add overwrite attribute and check write_backward
    perf tools: Set buildid dir under symfs when --symfs is provided
    perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
    perf annotate: Sort list of recognised instructions
    perf annotate: Fix identification of ARM blt and bls instructions
    perf tools: Fix usage of max_stack sysctl
    perf callchain: Stop validating callchains by the max_stack sysctl
    perf trace: Fix exit_group() formatting
    perf top: Use machine->kptr_restrict_warned
    perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
    perf machine: Do not bail out if not managing to read ref reloc symbol
    perf/x86/intel/p4: Trival indentation fix, remove space
    ...

    Linus Torvalds
     

24 May, 2016

1 commit

  • Follow-up to commit e27f4a942a0e ("bpf: Use mount_nodev not mount_ns
    to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag.

    The original idea was to have a per mountns instance instead of a
    single global fs instance, but that didn't work out and we had to
    switch to mount_nodev() model. The intent of that middle ground was
    that we avoid users who don't play nice to create endless instances
    of bpf fs which are difficult to control and discover from an admin
    point of view, but at the same time it would have allowed us to be
    more flexible with regard to namespaces.

    Therefore, since we now did the switch to mount_nodev() as a fix
    where individual instances are created, we also need to remove userns
    mount flag along with it to avoid running into mentioned situation.
    I don't expect any breakage at this early point in time with removing
    the flag and we can revisit this later should the requirement for
    this come up with future users. This and commit e27f4a942a0e have
    been split to facilitate tracking should any of them run into the
    unlikely case of causing a regression.

    Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
    Signed-off-by: Daniel Borkmann
    Acked-by: Hannes Frederic Sowa
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

21 May, 2016

4 commits

  • Humans don't write C code like:
    u8 *ptr = skb->data;
    int imm = 4;
    imm += ptr;
    but from llvm backend point of view 'imm' and 'ptr' are registers and
    imm += ptr may be preferred vs ptr += imm depending which register value
    will be used further in the code, while verifier can only recognize ptr += imm.
    That caused small unrelated changes in the C code of the bpf program to
    trigger rejection by the verifier. Therefore teach the verifier to recognize
    both ptr += imm and imm += ptr.
    For example:
    when R6=pkt(id=0,off=0,r=62) R7=imm22
    after r7 += r6 instruction
    will be R6=pkt(id=0,off=0,r=62) R7=pkt(id=0,off=22,r=62)

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • when packet headers are accessed in 'decreasing' order (like TCP port
    may be fetched before the program reads IP src) the llvm may generate
    the following code:
    [...] // R7=pkt(id=0,off=22,r=70)
    r2 = *(u32 *)(r7 +0) // good access
    [...]
    r7 += 40 // R7=pkt(id=0,off=62,r=70)
    r8 = *(u32 *)(r7 +0) // good access
    [...]
    r1 = *(u32 *)(r7 -20) // this one will fail though it's within a safe range
    // it's doing *(u32*)(skb->data + 42)
    Fix verifier to recognize such code pattern

    Alos turned out that 'off > range' condition is not a verifier bug.
    It's a buggy program that may do something like:
    if (ptr + 50 > data_end)
    return 0;
    ptr += 60;
    *(u32*)ptr;
    in such case emit
    "invalid access to packet, off=0 size=4, R1(id=0,off=60,r=50)" error message,
    so all information is available for the program author to fix the program.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
    bpf filesystem. Looking at the code I saw a broken usage of mount_ns
    with current->nsproxy->mnt_ns. As the code does not acquire a
    reference to the mount namespace it can not possibly be correct to
    store the mount namespace on the superblock as it does.

    Replace mount_ns with mount_nodev so that each mount of the bpf
    filesystem returns a distinct instance, and the code is not buggy.

    In discussion with Hannes Frederic Sowa it was reported that the use
    of mount_ns was an attempt to have one bpf instance per mount
    namespace, in an attempt to keep resources that pin resources from
    hiding. That intent simply does not work, the vfs is not built to
    allow that kind of behavior. Which means that the bpf filesystem
    really is buggy both semantically and in it's implemenation as it does
    not nor can it implement the original intent.

    This change is userspace visible, but my experience with similar
    filesystems leads me to believe nothing will break with a model of each
    mount of the bpf filesystem is distinct from all others.

    Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
    Cc: Hannes Frederic Sowa
    Acked-by: Daniel Borkmann
    Signed-off-by: "Eric W. Biederman"
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Start address randomization and blinding in BPF currently use
    prandom_u32(). prandom_u32() values are not exposed to unpriviledged
    user space to my knowledge, but given other kernel facilities such as
    ASLR, stack canaries, etc make use of stronger get_random_int(), we
    better make use of it here as well given blinding requests successively
    new random values. get_random_int() has minimal entropy pool depletion,
    is not cryptographically secure, but doesn't need to be for our use
    cases here.

    Suggested-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

19 May, 2016

1 commit

  • Pull misc vfs cleanups from Al Viro:
    "Assorted cleanups and fixes all over the place"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coredump: only charge written data against RLIMIT_CORE
    coredump: get rid of coredump_params->written
    ecryptfs_lookup(): try either only encrypted or plaintext name
    ecryptfs: avoid multiple aliases for directories
    bpf: reject invalid names right in ->lookup()
    __d_alloc(): treat NULL name as QSTR("/", 1)
    mtd: switch ubi_open_volume_path() to vfs_stat()
    mtd: switch open_mtd_by_chdev() to use of vfs_stat()

    Linus Torvalds
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

17 May, 2016

5 commits

  • This makes perf_callchain_{user,kernel}() receive the max stack
    as context for the perf_callchain_entry, instead of accessing
    the global sysctl_perf_event_max_stack.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Since the blinding is strictly only called from inside eBPF JITs,
    we need to change signatures for bpf_int_jit_compile() and
    bpf_prog_select_runtime() first in order to prepare that the
    eBPF program we're dealing with can change underneath. Hence,
    for call sites, we need to return the latest prog. No functional
    change in this patch.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Move the functionality to patch instructions out of the verifier
    code and into the core as the new bpf_patch_insn_single() helper
    will be needed later on for blinding as well. No changes in
    functionality.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Besides others, remove redundant comments where the code is self
    documenting enough, and properly indent various bpf_verifier_ops
    and bpf_prog_type_list declarations. Moreover, remove two exports
    that actually have no module user.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

11 May, 2016

1 commit


07 May, 2016

3 commits

  • since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach
    verifier its recognition of constants in conditional branches
    without affecting safety.
    Ex:
    if (reg == 123) {
    .. here verifier was marking reg->type as CONST_IMM
    instead keep reg as UNKNOWN_VALUE
    }

    Two verifier states with UNKNOWN_VALUE are equivalent, whereas
    CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range
    verification and other cases.
    So help search pruning by marking registers as UNKNOWN_VALUE
    where possible instead of CONST_IMM.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Extended BPF carried over two instructions from classic to access
    packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
    but due to their design they have to do length check for every access.
    When BPF is processing 20M packets per second single LD_ABS after JIT
    is consuming 3% cpu. Hence the need to optimize it further by amortizing
    the cost of 'off < skb_headlen' over multiple packet accesses.
    One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
    with similar usage as skb_header_pointer().
    The kernel part for interpreter and x64 JIT was implemented in [1], but such
    new insns behave like old ld_abs and abort the program with 'return 0' if
    access is beyond linear data. Such hidden control flow is hard to workaround
    plus changing JITs and rolling out new llvm is incovenient.

    Therefore allow cls_bpf/act_bpf program access skb->data directly:
    int bpf_prog(struct __sk_buff *skb)
    {
    struct iphdr *ip;

    if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
    /* packet too small */
    return 0;

    ip = skb->data + ETH_HLEN;

    /* access IP header fields with direct loads */
    if (ip->version != 4 || ip->saddr == 0x7f000001)
    return 1;
    [...]
    }

    This solution avoids introduction of new instructions. llvm stays
    the same and all JITs stay the same, but verifier has to work extra hard
    to prove safety of the above program.

    For XDP the direct store instructions can be allowed as well.

    The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
    the alignment. The complex packet parsers where packet pointer is adjusted
    incrementally cannot be tracked for alignment, so allow byte access in such cases
    and misaligned access on architectures that define efficient_unaligned_access

    [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • cleanup verifier code and prepare it for addition of "pointer to packet" logic

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

05 May, 2016

1 commit