03 Apr, 2014

3 commits

  • Pull networking updates from David Miller:
    "Here is my initial pull request for the networking subsystem during
    this merge window:

    1) Support for ESN in AH (RFC 4302) from Fan Du.

    2) Add full kernel doc for ethtool command structures, from Ben
    Hutchings.

    3) Add BCM7xxx PHY driver, from Florian Fainelli.

    4) Export computed TCP rate information in netlink socket dumps, from
    Eric Dumazet.

    5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas
    Dichtel.

    6) Convert many drivers to pci_enable_msix_range(), from Alexander
    Gordeev.

    7) Record SKB timestamps more efficiently, from Eric Dumazet.

    8) Switch to microsecond resolution for TCP round trip times, also
    from Eric Dumazet.

    9) Clean up and fix 6lowpan fragmentation handling by making use of
    the existing inet_frag api for it's implementation.

    10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss.

    11) Auto size SKB lengths when composing netlink messages based upon
    past message sizes used, from Eric Dumazet.

    12) qdisc dumps can take a long time, add a cond_resched(), From Eric
    Dumazet.

    13) Sanitize netpoll core and drivers wrt. SKB handling semantics.
    Get rid of never-used-in-tree netpoll RX handling. From Eric W
    Biederman.

    14) Support inter-address-family and namespace changing in VTI tunnel
    driver(s). From Steffen Klassert.

    15) Add Altera TSE driver, from Vince Bridgers.

    16) Optimizing csum_replace2() so that it doesn't adjust the checksum
    by checksumming the entire header, from Eric Dumazet.

    17) Expand BPF internal implementation for faster interpreting, more
    direct translations into JIT'd code, and much cleaner uses of BPF
    filtering in non-socket ocntexts. From Daniel Borkmann and Alexei
    Starovoitov"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits)
    netpoll: Use skb_irq_freeable to make zap_completion_queue safe.
    net: Add a test to see if a skb is freeable in irq context
    qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port'
    net: ptp: move PTP classifier in its own file
    net: sxgbe: make "core_ops" static
    net: sxgbe: fix logical vs bitwise operation
    net: sxgbe: sxgbe_mdio_register() frees the bus
    Call efx_set_channels() before efx->type->dimension_resources()
    xen-netback: disable rogue vif in kthread context
    net/mlx4: Set proper build dependancy with vxlan
    be2net: fix build dependency on VxLAN
    mac802154: make csma/cca parameters per-wpan
    mac802154: allow only one WPAN to be up at any given time
    net: filter: minor: fix kdoc in __sk_run_filter
    netlink: don't compare the nul-termination in nla_strcmp
    can: c_can: Avoid led toggling for every packet.
    can: c_can: Simplify TX interrupt cleanup
    can: c_can: Store dlc private
    can: c_can: Reduce register access
    can: c_can: Make the code readable
    ...

    Linus Torvalds
     
  • Pull HID updates from Jiri Kosina:
    - substantial cleanup of the generic and transport layers, in the
    direction of an ultimate goal of making struct hid_device completely
    transport independent, by Benjamin Tissoires
    - cp2112 driver from David Barksdale
    - a lot of fixes and new hardware support (Dualshock 4) to hid-sony
    driver, by Frank Praznik
    - support for Win 8.1 multitouch protocol by Andrew Duggan
    - other smaller fixes / device ID additions

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (75 commits)
    HID: sony: fix force feedback mismerge
    HID: sony: Set the quriks flag for Bluetooth controllers
    HID: sony: Fix Sixaxis cable state detection
    HID: uhid: Add UHID_CREATE2 + UHID_INPUT2
    HID: hyperv: fix _raw_request() prototype
    HID: hyperv: Implement a stub raw_request() entry point
    HID: hid-sensor-hub: fix sleeping function called from invalid context
    HID: multitouch: add support for Win 8.1 multitouch touchpads
    HID: remove hid_output_raw_report transport implementations
    HID: sony: do not rely on hid_output_raw_report
    HID: cp2112: remove the last hid_output_raw_report() call
    HID: cp2112: remove various hid_out_raw_report calls
    HID: multitouch: add support of other generic collections in hid-mt
    HID: multitouch: remove pen special handling
    HID: multitouch: remove registered devices with default behavior
    HID: hidp: Add a comment that some devices depend on the current behavior of uniq
    HID: sony: Prevent duplicate controller connections.
    HID: sony: Perform a boundry check on the sixaxis battery level index.
    HID: sony: Fix work queue issues
    HID: sony: Fix multi-line comment styling
    ...

    Linus Torvalds
     
  • Pull trivial tree updates from Jiri Kosina:
    "Usual rocket science -- mostly documentation and comment updates"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    sparse: fix comment
    doc: fix double words
    isdn: capi: fix "CAPI_VERSION" comment
    doc: DocBook: Fix typos in xml and template file
    Bluetooth: add module name for btwilink
    driver core: unexport static function create_syslog_header
    mmc: core: typo fix in printk specifier
    ARM: spear: clean up editing mistake
    net-sysfs: fix comment typo 'CONFIG_SYFS'
    doc: Insert MODULE_ in module-signing macros
    Documentation: update URL to hfsplus Technote 1150
    gpio: update path to documentation
    ixgbe: Fix format string in ixgbe_fcoe.
    Kconfig: Remove useless "default N" lines
    user_namespace.c: Remove duplicated word in comment
    CREDITS: fix formatting
    treewide: Fix typo in Documentation/DocBook
    mm: Fix warning on make htmldocs caused by slab.c
    ata: ata-samsung_cf: cleanup in header file
    idr: remove unused prototype of idr_free()

    Linus Torvalds
     

02 Apr, 2014

8 commits

  • Pull core block layer updates from Jens Axboe:
    "This is the pull request for the core block IO bits for the 3.15
    kernel. It's a smaller round this time, it contains:

    - Various little blk-mq fixes and additions from Christoph and
    myself.

    - Cleanup of the IPI usage from the block layer, and associated
    helper code. From Frederic Weisbecker and Jan Kara.

    - Duplicate code cleanup in bio-integrity from Gu Zheng. This will
    give you a merge conflict, but that should be easy to resolve.

    - blk-mq notify spinlock fix for RT from Mike Galbraith.

    - A blktrace partial accounting bug fix from Roman Pen.

    - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"

    * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
    blk-mq: add REQ_SYNC early
    rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
    blk-mq: support partial I/O completions
    blk-mq: merge blk_mq_insert_request and blk_mq_run_request
    blk-mq: remove blk_mq_alloc_rq
    blk-mq: don't dump CPU -> hw queue map on driver load
    blk-mq: fix wrong usage of hctx->state vs hctx->flags
    blk-mq: allow blk_mq_init_commands() to return failure
    block: remove old blk_iopoll_enabled variable
    blktrace: fix accounting of partially completed requests
    smp: Rename __smp_call_function_single() to smp_call_function_single_async()
    smp: Remove wait argument from __smp_call_function_single()
    watchdog: Simplify a little the IPI call
    smp: Move __smp_call_function_single() below its safe version
    smp: Consolidate the various smp_call_function_single() declensions
    smp: Teach __smp_call_function_single() to check for offline cpus
    smp: Remove unused list_head from csd
    smp: Iterate functions through llist_for_each_entry_safe()
    block: Stop abusing rq->csd.list in blk-softirq
    block: Remove useless IPI struct initialization
    ...

    Linus Torvalds
     
  • Replace the test in zap_completion_queue to test when it is safe to
    free skbs in hard irq context with skb_irq_freeable ensuring we only
    free skbs when it is safe, and removing the possibility of subtle
    problems.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This commit fixes a build error reported by Fengguang, that is
    triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:

    ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!

    The fix is to introduce its own file for the PTP BPF classifier,
    so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
    it independently from each other. IXP4xx driver on ARM needs to
    select it as well since it does not seem to select PTP_1588_CLOCK
    or similar that would pull it in automatically.

    This also allows for hiding all of the internals of the BPF PTP
    program inside that file, and only exporting relevant API bits
    to drivers.

    This patch also adds a kdoc documentation of ptp_classify_raw()
    API to make it clear that it can return PTP_CLASS_* defines. Also,
    the BPF program has been translated into bpf_asm code, so that it
    can be more easily read and altered (extensively documented in [1]).

    In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
    tools, so the commented program can simply be translated via
    `./bpf_asm -c prog` where prog is a file that contains the
    commented code. This makes it easily readable/verifiable and when
    there's a need to change something, jump offsets etc do not need
    to be replaced manually which can be very error prone. Instead,
    a newly translated version via bpf_asm can simply replace the old
    code. I have checked opcode diffs before/after and it's the very
    same filter.

    [1] Documentation/networking/filter.txt

    Fixes: 164d8c666521 ("net: ptp: do not reimplement PTP/BPF classifier")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Richard Cochran
    Cc: Jiri Benc
    Acked-by: Richard Cochran
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Commit 9b2777d6089bcd (ieee802154: add TX power control to wpan_phy)
    and following erroneously added CSMA and CCA parameters for 802.15.4
    devices as PHY parameters, while they are actually MAC parameters and
    can differ for any two WPAN instances. Since it is now sensible to have
    multiple WPAN devices with differing CSMA/CCA parameters, make these
    parameters MAC parameters instead.

    Signed-off-by: Phoebe Buckheister
    Signed-off-by: David S. Miller

    Phoebe Buckheister
     
  • All 802.15.4 PHY devices with drivers in tree can support only one WPAN
    at any given time, yet the stack allows arbitrarily many WPAN devices to
    be created and up at the same time. This cannot work with what the
    hardware provides, and in the current implementation, provides an easy
    DoS vector to any process on the system that may call socket() and
    sendmsg().

    Thus, allow only one WPAN per PHY to be up at once, just like mac80211
    does for managed devices.

    Signed-off-by: Phoebe Buckheister
    Signed-off-by: David S. Miller

    Phoebe Buckheister
     
  • This minor patch fixes the following warning when doing
    a `make htmldocs`:

    DOCPROC Documentation/DocBook/networking.xml
    Warning(.../net/core/filter.c:135): No description found for parameter 'insn'
    Warning(.../net/core/filter.c:135): Excess function parameter 'fentry' description in '__sk_run_filter'
    HTML Documentation/DocBook/networking.html

    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Cc: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Conflicts:
    drivers/hid/hid-ids.h
    drivers/hid/hid-sony.c
    drivers/hid/i2c-hid/i2c-hid.c

    Jiri Kosina
     
  • Jiri Kosina
     

01 Apr, 2014

14 commits

  • Pull workqueue changes from Tejun Heo:
    "PREPARE_[DELAYED_]WORK() were used to change the work function of work
    items without fully reinitializing it; however, this makes workqueue
    consider the work item as a different one from before and allows the
    work item to start executing before the previous instance is finished
    which can lead to extremely subtle issues which are painful to debug.

    The interface has never been popular. This pull request contains
    patches to remove existing usages and kill the interface. As one of
    the changes was routed during the last devel cycle and another
    depended on a pending change in nvme, for-3.15 contains a couple merge
    commits.

    In addition, interfaces which were deprecated quite a while ago -
    __cancel_delayed_work() and WQ_NON_REENTRANT - are removed too"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: remove deprecated WQ_NON_REENTRANT
    workqueue: Spelling s/instensive/intensive/
    workqueue: remove PREPARE_[DELAYED_]WORK()
    staging/fwserial: don't use PREPARE_WORK
    afs: don't use PREPARE_WORK
    nvme: don't use PREPARE_WORK
    usb: don't use PREPARE_DELAYED_WORK
    floppy: don't use PREPARE_[DELAYED_]WORK
    ps3-vuart: don't use PREPARE_WORK
    wireless/rt2x00: don't use PREPARE_WORK in rt2800usb.c
    workqueue: Remove deprecated __cancel_delayed_work()

    Linus Torvalds
     
  • Pull s390 compat wrapper rework from Heiko Carstens:
    "S390 compat system call wrapper simplification work.

    The intention of this work is to get rid of all hand written assembly
    compat system call wrappers on s390, which perform proper sign or zero
    extension, or pointer conversion of compat system call parameters.
    Instead all of this should be done with C code eg by using Al's
    COMPAT_SYSCALL_DEFINEx() macro.

    Therefore all common code and s390 specific compat system calls have
    been converted to the COMPAT_SYSCALL_DEFINEx() macro.

    In order to generate correct code all compat system calls may only
    have eg compat_ulong_t parameters, but no unsigned long parameters.
    Those patches which change parameter types from unsigned long to
    compat_ulong_t parameters are separate in this series, but shouldn't
    cause any harm.

    The only compat system calls which intentionally have 64 bit
    parameters (preadv64 and pwritev64) in support of the x86/32 ABI
    haven't been changed, but are now only available if an architecture
    defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.

    System calls which do not have a compat variant but still need proper
    zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
    get a proper wrapper function with the new s390 specific
    COMPAT_SYSCALL_WRAPx() macro:

    COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);

    which generates the following code (simplified):

    asmlinkage long sys_brk(unsigned long brk);
    asmlinkage long compat_sys_brk(long brk)
    {
    return sys_brk((u32)brk);
    }

    Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
    includes both linux/syscall.h and linux/compat.h, it will generate
    build errors, if the declaration of sys_brk() doesn't match, or if
    there exists a non-matching compat_sys_brk() declaration.

    In addition this will intentionally result in a link error if
    somewhere else a compat_sys_brk() function exists, which probably
    should have been used instead. Two more BUILD_BUG_ONs make sure the
    size and type of each compat syscall parameter can be handled
    correctly with the s390 specific macros.

    I converted the compat system calls step by step to verify the
    generated code is correct and matches the previous code. In fact it
    did not always match, however that was always a bug in the hand
    written asm code.

    In result we get less code, less bugs, and much more sanity checking"

    * 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits)
    s390/compat: add copyright statement
    compat: include linux/unistd.h within linux/compat.h
    s390/compat: get rid of compat wrapper assembly code
    s390/compat: build error for large compat syscall args
    mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: convert to COMPAT_SYSCALL_DEFINE
    security/compat: convert to COMPAT_SYSCALL_DEFINE
    mm/compat: convert to COMPAT_SYSCALL_DEFINE
    net/compat: convert to COMPAT_SYSCALL_DEFINE
    kernel/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: optional preadv64/pwrite64 compat system calls
    ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t
    s390/compat: partial parameter conversion within syscall wrappers
    s390/compat: automatic zero, sign and pointer conversion of syscalls
    s390/compat: add sync_file_range and fallocate compat syscalls
    ...

    Linus Torvalds
     
  • John W. Linville says:

    ====================
    pull request: wireless-next 2014-03-31

    Please accept this one last round of general wireless updates for
    the 3.15 merge window!

    For the Bluetooth bits, Gustavo says:

    "Here follow another set of patches to 3.15. This is mostly a bug fix
    pull request with the exception of one commit from Marcel which adds
    tracking to the current configured LE scan type parameter."

    Beyond that, notable bits include some final refactoring of rtl8180
    and the addition of the rtl8187se driver, fixes for a number of
    problems identified by Dan Carpenter and his static analysis tools,
    and a handful of other bits here and there.

    Please let me know if there are problems!
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • After commit c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify
    processing to workqueue") some counters are now updated in process context
    and thus need to disable bh before doing so, otherwise deadlocks can
    happen on 32-bit archs. Fabio Estevam noticed this while while mounting
    a NFS volume on an ARM board.

    As a compensation for missing this I looked after the other *_STATS_BH
    and found three other calls which need updating:

    1) icmp6_send: ip6_fragment -> icmpv6_send -> icmp6_send (error handling)
    2) ip6_push_pending_frames: rawv6_sendmsg -> rawv6_push_pending_frames -> ...
    (only in case of icmp protocol with raw sockets in error handling)
    3) ping6_v6_sendmsg (error handling)

    Fixes: c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify processing to workqueue")
    Reported-by: Fabio Estevam
    Tested-by: Fabio Estevam
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • First off, we don't need to check for non-NULL rt any more, as we are
    guaranteed to always get a valid rt6_info. Drop the check.

    In case we couldn't allocate an inet_peer for fragmentation information
    we currently generate strictly incrementing fragmentation ids for all
    destination. This is done to maximize the cycle and avoid collisions.

    Those fragmentation ids are very predictable. At least we should try to
    mix in the destination address.

    While it should make no difference to simply use a PRNG at this point,
    secure_ipv6_id ensures that we don't leak information from prandom,
    so its internal state could be recoverable.

    This fallback function should normally not get used thus this should
    not affect performance at all. It is just meant as a safety net.

    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Main difference between napi_frags_skb() and napi_gro_receive() is that
    the later is called while ethernet header was already pulled by the NIC
    driver (eth_type_trans() was called before napi_gro_receive())

    Jerry Chu in commit 299603e8370a ("net-gro: Prepare GRO stack for the
    upcoming tunneling support") tried to remove this difference by calling
    eth_type_trans() from napi_frags_skb() instead of doing this later from
    napi_frags_finish()

    Goal was that napi_gro_complete() could call
    ptype->callbacks.gro_complete(skb, 0) (offset of first network header =
    0)

    Also, xxx_gro_receive() handlers all use off = skb_gro_offset(skb) to
    point to their own header, for the current skb and ones held in gro_list

    Problem is this cleanup work defeated the frag0 optimization:
    It turns out the consecutive pskb_may_pull() calls are too expensive.

    This patch brings back the frag0 stuff in napi_frags_skb().

    As all skb have their mac header in skb head, we no longer need
    skb_gro_mac_header()

    Reported-by: Michal Schmidt
    Fixes: 299603e8370a ("net-gro: Prepare GRO stack for the upcoming tunneling support")
    Signed-off-by: Eric Dumazet
    Cc: Jerry Chu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Binding might result in a NULL device which is later dereferenced
    without checking.

    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     
  • This allows to monitor carrier on/off transitions and detect link
    flapping issues:
    - new /sys/class/net/X/carrier_changes
    - new rtnetlink IFLA_CARRIER_CHANGES (getlink)

    Tested:
    - grep . /sys/class/net/*/carrier_changes
    + ip link set dev X down/up
    + plug/unplug cable
    - updated iproute2: prints IFLA_CARRIER_CHANGES
    - iproute2 20121211-2 (debian): unchanged behavior

    Signed-off-by: David Decotigny
    Signed-off-by: David S. Miller

    david decotigny
     
  • The issue raises when adding policy route, specify a particular
    NIC as oif, the policy route did not take effect. The reason is
    that fl6.oif is not set and route map failed. From the
    tcp_v6_send_response function, if the binding address is linklocal,
    fl6.oif is set, but not for global address.

    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Wang Yufen
    Signed-off-by: David S. Miller

    Wang Yufen
     
  • Move the whole rt6_need_strict as static inline into ip6_route.h,
    so that it can be reused

    Signed-off-by: Wang Yufen
    Signed-off-by: David S. Miller

    Wang Yufen
     
  • Signed-off-by: Wang Yufen
    Signed-off-by: David S. Miller

    Wang Yufen
     
  • Use existing function instead of trying to use our own.

    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • …wireless-next into for-davem

    John W. Linville
     

31 Mar, 2014

6 commits

  • This patch replaces/reworks the kernel-internal BPF interpreter with
    an optimized BPF instruction set format that is modelled closer to
    mimic native instruction sets and is designed to be JITed with one to
    one mapping. Thus, the new interpreter is noticeably faster than the
    current implementation of sk_run_filter(); mainly for two reasons:

    1. Fall-through jumps:

    BPF jump instructions are forced to go either 'true' or 'false'
    branch which causes branch-miss penalty. The new BPF jump
    instructions have only one branch and fall-through otherwise,
    which fits the CPU branch predictor logic better. `perf stat`
    shows drastic difference for branch-misses between the old and
    new code.

    2. Jump-threaded implementation of interpreter vs switch
    statement:

    Instead of single table-jump at the top of 'switch' statement,
    gcc will now generate multiple table-jump instructions, which
    helps CPU branch predictor logic.

    Note that the verification of filters is still being done through
    sk_chk_filter() in classical BPF format, so filters from user- or
    kernel space are verified in the same way as we do now, and same
    restrictions/constraints hold as well.

    We reuse current BPF JIT compilers in a way that this upgrade would
    even be fine as is, but nevertheless allows for a successive upgrade
    of BPF JIT compilers to the new format.

    The internal instruction set migration is being done after the
    probing for JIT compilation, so in case JIT compilers are able to
    create a native opcode image, we're going to use that, and in all
    other cases we're doing a follow-up migration of the BPF program's
    instruction set, so that it can be transparently run in the new
    interpreter.

    In short, the *internal* format extends BPF in the following way (more
    details can be taken from the appended documentation):

    - Number of registers increase from 2 to 10
    - Register width increases from 32-bit to 64-bit
    - Conditional jt/jf targets replaced with jt/fall-through
    - Adds signed > and >= insns
    - 16 4-byte stack slots for register spill-fill replaced
    with up to 512 bytes of multi-use stack space
    - Introduction of bpf_call insn and register passing convention
    for zero overhead calls from/to other kernel functions
    - Adds arithmetic right shift and endianness conversion insns
    - Adds atomic_add insn
    - Old tax/txa insns are replaced with 'mov dst,src' insn

    Performance of two BPF filters generated by libpcap resp. bpf_asm
    was measured on x86_64, i386 and arm32 (other libpcap programs
    have similar performance differences):

    fprog #1 is taken from Documentation/networking/filter.txt:
    tcpdump -i eth0 port 22 -dd

    fprog #2 is taken from 'man tcpdump':
    tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<>2)) != 0)' -dd

    Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
    same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
    smaller is better:

    --x86_64--
    fprog #1 fprog #1 fprog #2 fprog #2
    cache-hit cache-miss cache-hit cache-miss
    old BPF 90 101 192 202
    new BPF 31 71 47 97
    old BPF jit 12 34 17 44
    new BPF jit TBD

    --i386--
    fprog #1 fprog #1 fprog #2 fprog #2
    cache-hit cache-miss cache-hit cache-miss
    old BPF 107 136 227 252
    new BPF 40 119 69 172

    --arm32--
    fprog #1 fprog #1 fprog #2 fprog #2
    cache-hit cache-miss cache-hit cache-miss
    old BPF 202 300 475 540
    new BPF 180 270 330 470
    old BPF jit 26 182 37 202
    new BPF jit TBD

    Thus, without changing any userland BPF filters, applications on
    top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
    classifier, netfilter's xt_bpf, team driver's load-balancing mode,
    and many more will have better interpreter filtering performance.

    While we are replacing the internal BPF interpreter, we also need
    to convert seccomp BPF in the same step to make use of the new
    internal structure since it makes use of lower-level API details
    without being further decoupled through higher-level calls like
    sk_unattached_filter_{create,destroy}(), for example.

    Just as for normal socket filtering, also seccomp BPF experiences
    a time-to-verdict speedup:

    05-sim-long_jumps.c of libseccomp was used as micro-benchmark:

    seccomp_rule_add_exact(ctx,...
    seccomp_rule_add_exact(ctx,...

    rc = seccomp_load(ctx);

    for (i = 0; i < 10000000; i++)
    syscall(199, 100);

    'short filter' has 2 rules
    'large filter' has 200 rules

    'short filter' performance is slightly better on x86_64/i386/arm32
    'large filter' is much faster on x86_64 and i386 and shows no
    difference on arm32

    --x86_64-- short filter
    old BPF: 2.7 sec
    39.12% bench libc-2.15.so [.] syscall
    8.10% bench [kernel.kallsyms] [k] sk_run_filter
    6.31% bench [kernel.kallsyms] [k] system_call
    5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
    4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
    3.70% bench [kernel.kallsyms] [k] __secure_computing
    3.67% bench [kernel.kallsyms] [k] lock_is_held
    3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
    new BPF: 2.58 sec
    42.05% bench libc-2.15.so [.] syscall
    6.91% bench [kernel.kallsyms] [k] system_call
    6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
    6.07% bench [kernel.kallsyms] [k] __secure_computing
    5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp

    --arm32-- short filter
    old BPF: 4.0 sec
    39.92% bench [kernel.kallsyms] [k] vector_swi
    16.60% bench [kernel.kallsyms] [k] sk_run_filter
    14.66% bench libc-2.17.so [.] syscall
    5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
    5.10% bench [kernel.kallsyms] [k] __secure_computing
    new BPF: 3.7 sec
    35.93% bench [kernel.kallsyms] [k] vector_swi
    21.89% bench libc-2.17.so [.] syscall
    13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
    6.25% bench [kernel.kallsyms] [k] __secure_computing
    3.96% bench [kernel.kallsyms] [k] syscall_trace_exit

    --x86_64-- large filter
    old BPF: 8.6 seconds
    73.38% bench [kernel.kallsyms] [k] sk_run_filter
    10.70% bench libc-2.15.so [.] syscall
    5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
    1.97% bench [kernel.kallsyms] [k] system_call
    new BPF: 5.7 seconds
    66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
    16.75% bench libc-2.15.so [.] syscall
    3.31% bench [kernel.kallsyms] [k] system_call
    2.88% bench [kernel.kallsyms] [k] __secure_computing

    --i386-- large filter
    old BPF: 5.4 sec
    new BPF: 3.8 sec

    --arm32-- large filter
    old BPF: 13.5 sec
    73.88% bench [kernel.kallsyms] [k] sk_run_filter
    10.29% bench [kernel.kallsyms] [k] vector_swi
    6.46% bench libc-2.17.so [.] syscall
    2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
    1.19% bench [kernel.kallsyms] [k] __secure_computing
    0.87% bench [kernel.kallsyms] [k] sys_getuid
    new BPF: 13.5 sec
    76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
    10.98% bench [kernel.kallsyms] [k] vector_swi
    5.87% bench libc-2.17.so [.] syscall
    1.77% bench [kernel.kallsyms] [k] __secure_computing
    0.93% bench [kernel.kallsyms] [k] sys_getuid

    BPF filters generated by seccomp are very branchy, so the new
    internal BPF performance is better than the old one. Performance
    gains will be even higher when BPF JIT is committed for the
    new structure, which is planned in future work (as successive
    JIT migrations).

    BPF has also been stress-tested with trinity's BPF fuzzer.

    Joint work with Daniel Borkmann.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Cc: Hagen Paul Pfeifer
    Cc: Kees Cook
    Cc: Paul Moore
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: linux-kernel@vger.kernel.org
    Acked-by: Kees Cook
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • There are currently pch_gbe, cpts, and ixp4xx_eth drivers that open-code
    and reimplement a BPF classifier for the PTP protocol. Since all of them
    effectively do the very same thing and load the very same PTP/BPF filter,
    we can just consolidate that code by introducing ptp_classify_raw() in
    the time-stamping core framework which can be used in drivers.

    As drivers get initialized after bootstrapping the core networking
    subsystem, they can make use of ptp_insns wrapped through
    ptp_classify_raw(), which allows to simplify and remove PTP classifier
    setup code in drivers.

    Joint work with Alexei Starovoitov.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Richard Cochran
    Cc: Jiri Benc
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This patch migrates an open-coded sk_run_filter() implementation with
    proper use of the BPF API, that is, sk_unattached_filter_create(). This
    migration is needed, as we will be internally transforming the filter
    to a different representation, and therefore needs to be decoupled.

    It is okay to do so as skb_timestamping_init() is called during
    initialization of the network stack in core initcall via sock_init().
    This would effectively also allow for PTP filters to be jit compiled if
    bpf_jit_enable is set.

    For better readability, there are also some newlines introduced, also
    ptp_classify.h is only in kernel space.

    Joint work with Alexei Starovoitov.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Richard Cochran
    Cc: Jiri Benc
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This patch basically does two things, i) removes the extern keyword
    from the include/linux/filter.h file to be more consistent with the
    rest of Joe's changes, and ii) moves filter accounting into the filter
    core framework.

    Filter accounting mainly done through sk_filter_{un,}charge() take
    care of the case when sockets are being cloned through sk_clone_lock()
    so that removal of the filter on one socket won't result in eviction
    as it's still referenced by the other.

    These functions actually belong to net/core/filter.c and not
    include/net/sock.h as we want to keep all that in a central place.
    It's also not in fast-path so uninlining them is fine and even allows
    us to get rd of sk_filter_release_rcu()'s EXPORT_SYMBOL and a forward
    declaration.

    Joint work with Alexei Starovoitov.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In order to open up the possibility to internally transform a BPF program
    into an alternative and possibly non-trivial reversible representation, we
    need to keep the original BPF program around, so that it can be passed back
    to user space w/o the need of a complex decoder.

    The reason for that use case resides in commit a8fc92778080 ("sk-filter:
    Add ability to get socket filter program (v2)"), that is, the ability
    to retrieve the currently attached BPF filter from a given socket used
    mainly by the checkpoint-restore project, for example.

    Therefore, we add two helpers sk_{store,release}_orig_filter for taking
    care of that. In the sk_unattached_filter_create() case, there's no such
    possibility/requirement to retrieve a loaded BPF program. Therefore, we
    can spare us the work in that case.

    This approach will simplify and slightly speed up both, sk_get_filter()
    and sock_diag_put_filterinfo() handlers as we won't need to successively
    decode filters anymore through sk_decode_filter(). As we still need
    sk_decode_filter() later on, we're keeping it around.

    Joint work with Alexei Starovoitov.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This patch adds a jited flag into sk_filter struct in order to indicate
    whether a filter is currently jited or not. The size of sk_filter is
    not being expanded as the 32 bit 'len' member allows upper bits to be
    reused since a filter can currently only grow as large as BPF_MAXINSNS.

    Therefore, there's enough room also for other in future needed flags to
    reuse 'len' field if necessary. The jited flag also allows for having
    alternative interpreter functions running as currently, we can only
    detect jit compiled filters by testing fp->bpf_func to not equal the
    address of sk_run_filter().

    Joint work with Alexei Starovoitov.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Cc: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Mar, 2014

9 commits

  • Conflicts:
    drivers/net/ethernet/marvell/mvneta.c

    The mvneta.c conflict is a case of overlapping changes,
    a conversion to devm_ioremap_resource() vs. a conversion
    to netdev_alloc_pcpu_stats.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • ERROR: "(foo*)" should be "(foo *)"
    ERROR: "foo * bar" should be "foo *bar"

    Suggested-by: Sergei Shtylyov
    Signed-off-by: Wang Yufen
    Acked-by: Sergei Shtylyov
    Signed-off-by: David S. Miller

    Wang Yufen
     
  • ERROR: open brace '{' following enum go on the same line
    ERROR: open brace '{' following struct go on the same line
    ERROR: trailing statements should be on next line

    Signed-off-by: Wang Yufen
    Signed-off-by: David S. Miller

    Wang Yufen
     
  • WARNING: please, no space before tabs
    WARNING: please, no spaces at the start of a line
    ERROR: spaces required around that ':' (ctx:VxW)
    ERROR: spaces required around that '>' (ctx:VxV)
    ERROR: spaces required around that '>=' (ctx:VxV)

    Signed-off-by: Wang Yufen
    Signed-off-by: David S. Miller

    Wang Yufen
     
  • Stop taking the transmit lock when a network device has specified
    NETIF_F_LLTX.

    If no locks needed to trasnmit a packet this is the ideal scenario for
    netpoll as all packets can be trasnmitted immediately.

    Even if some locks are needed in ndo_start_xmit skipping any unnecessary
    serialization is desirable for netpoll as it makes it more likely a
    debugging packet may be trasnmitted immediately instead of being
    deferred until later.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Remove the assumption that the skbs that make it to
    netpoll_send_skb_on_dev are allocated with find_skb, such that
    skb->users == 1 and nothing is attached that would prevent the skbs from
    being freed from hard irq context.

    Remove this assumption by replacing __kfree_skb on error paths with
    dev_kfree_skb_irq (in hard irq context) and kfree_skb (in process
    context).

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The netpoll_rx_enable and netpoll_rx_disable functions have always
    controlled polling the network drivers transmit and receive queues.

    Rename them to netpoll_poll_enable and netpoll_poll_disable to make
    their functionality clear.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Today netpoll_rx_enable and netpoll_rx_disable are called from
    dev_close and and __dev_close, and not from dev_close_many.

    Move the calls into __dev_close_many so that we have a single call
    site to maintain, and so that dev_close_many gains this protection as
    well. Which importantly makes batched network device deletes safe.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Factor out the code that needs to surround ndo_start_xmit
    from netpoll_send_skb_on_dev into netpoll_start_xmit.

    It is an unfortunate fact that as the netpoll code has been maintained
    the primary call site ndo_start_xmit learned how to handle vlans
    and timestamps but the second call of ndo_start_xmit in queue_process
    did not.

    With the introduction of netpoll_start_xmit this associated logic now
    happens at both call sites of ndo_start_xmit and should make it easy
    for that to continue into the future.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman