15 Jun, 2014

1 commit

  • Geert reported issues regarding checksum complete and UDP.
    The logic introduced in commit 7e3cead5172927732f51fde
    ("net: Save software checksum complete") is not correct.

    This patch:
    1) Restores code in __skb_checksum_complete_header except for setting
    CHECKSUM_UNNECESSARY. This function may be calculating checksum on
    something less than skb->len.
    2) Adds saving checksum to __skb_checksum_complete. The full packet
    checksum 0..skb->len is calculated without adding in pseudo header.
    This value is saved in skb->csum and then the pseudo header is added
    to that to derive the checksum for validation.
    3) In both __skb_checksum_complete_header and __skb_checksum_complete,
    set skb->csum_valid to whether checksum of zero was computed. This
    allows skb_csum_unnecessary to return true without changing to
    CHECKSUM_UNNECESSARY which was done previously.
    4) Copy new csum related bits in __copy_skb_header.

    Reported-by: Geert Uytterhoeven
    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

13 Jun, 2014

2 commits

  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     
  • When running RHEL6 userspace on a current upstream kernel, "ip link"
    fails to show VF information.

    The reason is a kerneluserspace API change introduced by commit
    88c5b5ce5cb57 ("rtnetlink: Call nlmsg_parse() with correct header length"),
    after which the kernel does not see iproute2's IFLA_EXT_MASK attribute
    in the netlink request.

    iproute2 adjusted for the API change in its commit 63338dca4513
    ("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter").

    The problem has been noticed before:
    http://marc.info/?l=linux-netdev&m=136692296022182&w=2
    (Subject: Re: getting VF link info seems to be broken in 3.9-rc8)

    We can do better than tell those with old userspace to upgrade. We can
    recognize the old iproute2 in the kernel by checking the netlink message
    length. Even when including the IFLA_EXT_MASK attribute, its netlink
    message is shorter than struct ifinfomsg.

    With this patch "ip link" shows VF information in both old and new
    iproute2 versions.

    Signed-off-by: Michal Schmidt
    Signed-off-by: David S. Miller

    Michal Schmidt
     

12 Jun, 2014

5 commits

  • Conflicts:
    net/core/rtnetlink.c
    net/core/skbuff.c

    Both conflicts were very simple overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit 1d8faf48c7 (net/core: Add VF link state control) added VF link state
    control to the netlink VF nested structure, but failed to add a proper entry
    for the new structure into the VF policy table. Add the missing entry so
    the table and the actual data copied into the netlink nested struct are in
    sync.

    Signed-off-by: Doug Ledford
    Signed-off-by: David S. Miller

    Doug Ledford
     
  • In skb_checksum complete, if we need to compute the checksum for the
    packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
    Subsequent checksum verification can use this.

    Also, added csum_complete_sw flag to distinguish between software and
    hardware generated checksum complete, we should always be able to trust
    the software computation.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • There are several instances where a pskb_copy or __pskb_copy is
    immediately followed by an skb_clone.

    Add a couple of new functions to allow the copy skb to be allocated
    from the fclone cache and thus speed up subsequent skb_clone calls.

    Cc: Alexander Smirnov
    Cc: Dmitry Eremin-Solenikov
    Cc: Marek Lindner
    Cc: Simon Wunderlich
    Cc: Antonio Quartulli
    Cc: Marcel Holtmann
    Cc: Gustavo Padovan
    Cc: Johan Hedberg
    Cc: Arvid Brodin
    Cc: Patrick McHardy
    Cc: Pablo Neira Ayuso
    Cc: Jozsef Kadlecsik
    Cc: Lauro Ramos Venancio
    Cc: Aloisio Almeida Jr
    Cc: Samuel Ortiz
    Cc: Jon Maloy
    Cc: Allan Stephens
    Cc: Andrew Hendry
    Cc: Eric Dumazet
    Reviewed-by: Christoph Paasch
    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     
  • fix compiler warning on 32-bit architectures:

    net/core/filter.c: In function '__sk_run_filter':
    net/core/filter.c:540:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    net/core/filter.c:550:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    net/core/filter.c:560:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

    Reported-by: Fengguang Wu
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

11 Jun, 2014

2 commits

  • This patch fixes a kernel BUG_ON in skb_segment. It is hit when
    testing two VMs on openvswitch with one VM acting as VXLAN gateway.

    During VXLAN packet GSO, skb_segment is called with skb->data
    pointing to inner TCP payload. skb_segment calls skb_network_protocol
    to retrieve the inner protocol. skb_network_protocol actually expects
    skb->data to point to MAC and it calls pskb_may_pull with ETH_HLEN.
    This ends up pulling in ETH_HLEN data from header tail. As a result,
    pskb_trim logic is skipped and BUG_ON is hit later.

    Move skb_push in front of skb_network_protocol so that skb->data
    lines up properly.

    kernel BUG at net/core/skbuff.c:2999!
    Call Trace:
    [] tcp_gso_segment+0x122/0x410
    [] inet_gso_segment+0x13c/0x390
    [] skb_mac_gso_segment+0x9b/0x170
    [] skb_udp_tunnel_segment+0xd8/0x390
    [] udp4_ufo_fragment+0x120/0x140
    [] inet_gso_segment+0x13c/0x390
    [] ? default_wake_function+0x12/0x20
    [] skb_mac_gso_segment+0x9b/0x170
    [] __skb_gso_segment+0x60/0xc0
    [] dev_hard_start_xmit+0x183/0x550
    [] sch_direct_xmit+0xfe/0x1d0
    [] __dev_queue_xmit+0x214/0x4f0
    [] dev_queue_xmit+0x10/0x20
    [] ip_finish_output+0x66b/0x890
    [] ip_output+0x58/0x90
    [] ? fib_table_lookup+0x29f/0x350
    [] ip_local_out_sk+0x39/0x50
    [] iptunnel_xmit+0x10d/0x130
    [] vxlan_xmit_skb+0x1d0/0x330 [vxlan]
    [] vxlan_tnl_send+0x129/0x1a0 [openvswitch]
    [] ovs_vport_send+0x26/0xa0 [openvswitch]
    [] do_output+0x2e/0x50 [openvswitch]

    Signed-off-by: Wei-Chun Chao
    Signed-off-by: David S. Miller

    Wei-Chun Chao
     
  • The macro 'A' used in internal BPF interpreter:
    #define A regs[insn->a_reg]
    was easily confused with the name of classic BPF register 'A', since
    'A' would mean two different things depending on context.

    This patch is trying to clean up the naming and clarify its usage in the
    following way:

    - A and X are names of two classic BPF registers

    - BPF_REG_A denotes internal BPF register R0 used to map classic register A
    in internal BPF programs generated from classic

    - BPF_REG_X denotes internal BPF register R7 used to map classic register X
    in internal BPF programs generated from classic

    - internal BPF instruction format:
    struct sock_filter_int {
    __u8 code; /* opcode */
    __u8 dst_reg:4; /* dest register */
    __u8 src_reg:4; /* source register */
    __s16 off; /* signed offset */
    __s32 imm; /* signed immediate constant */
    };

    - BPF_X/BPF_K is 1 bit used to encode source operand of instruction
    In classic:
    BPF_X - means use register X as source operand
    BPF_K - means use 32-bit immediate as source operand
    In internal:
    BPF_X - means use 'src_reg' register as source operand
    BPF_K - means use 32-bit immediate as source operand

    Suggested-by: Chema Gonzalez
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Chema Gonzalez
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

09 Jun, 2014

2 commits

  • unregister_netdevice_many() API is error prone and we had too
    many bugs because of dangling LIST_HEAD on stacks.

    See commit f87e6f47933e3e ("net: dont leave active on stack LIST_HEAD")

    In fact, instead of making sure no caller leaves an active list_head,
    just force a list_del() in the callee. No one seems to need to access
    the list after unregister_netdevice_many()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Now that 3.15 is released, this merges the 'next' branch into 'master',
    bringing us to the normal situation where my 'master' branch is the
    merge window.

    * accumulated work in next: (6809 commits)
    ufs: sb mutex merge + mutex_destroy
    powerpc: update comments for generic idle conversion
    cris: update comments for generic idle conversion
    idle: remove cpu_idle() forward declarations
    nbd: zero from and len fields in NBD_CMD_DISCONNECT.
    mm: convert some level-less printks to pr_*
    MAINTAINERS: adi-buildroot-devel is moderated
    MAINTAINERS: add linux-api for review of API/ABI changes
    mm/kmemleak-test.c: use pr_fmt for logging
    fs/dlm/debug_fs.c: replace seq_printf by seq_puts
    fs/dlm/lockspace.c: convert simple_str to kstr
    fs/dlm/config.c: convert simple_str to kstr
    mm: mark remap_file_pages() syscall as deprecated
    mm: memcontrol: remove unnecessary memcg argument from soft limit functions
    mm: memcontrol: clean up memcg zoneinfo lookup
    mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
    mm/mempool.c: update the kmemleak stack trace for mempool allocations
    lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
    mm: introduce kmemleak_update_trace()
    mm/kmemleak.c: use %u to print ->checksum
    ...

    Linus Torvalds
     

06 Jun, 2014

3 commits

  • Conflicts:
    drivers/net/xen-netback/netback.c
    net/core/filter.c

    A filter bug fix overlapped some cleanups and a conversion
    over to some new insn generation macros.

    A xen-netback bug fix overlapped the addition of multi-queue
    support.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • BPF classic->internal converter broke SKF_AD_PKTTYPE extension, since
    pkt_type_offset() was failing to find skb->pkt_type field which is defined as:
    __u8 pkt_type:3,
    fclone:2,
    ipvs_property:1,
    peeked:1,
    nf_trace:1;

    Fix it by searching for 3 most significant bits and shift them by 5 at run-time

    Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Tested-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • If an MPLS packet requires segmentation then use mpls_features
    to determine if the software implementation should be used.

    As no driver advertises MPLS GSO segmentation this will always be
    the case.

    I had not noticed that this was necessary before as software MPLS GSO
    segmentation was already being used in my test environment. I believe that
    the reason for that is the skbs in question always had fragments and the
    driver I used does not advertise NETIF_F_FRAGLIST (which seems to be the
    case for most drivers). Thus software segmentation was activated by
    skb_gso_ok().

    This introduces the overhead of an extra call to skb_network_protocol()
    in the case where where CONFIG_NET_MPLS_GSO is set and
    skb->ip_summed == CHECKSUM_NONE.

    Thanks to Jesse Gross for prompting me to investigate this.

    Signed-off-by: Simon Horman
    Acked-by: YAMAMOTO Takashi
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Simon Horman
     

05 Jun, 2014

2 commits

  • It is available since v3.15-rc5.

    Cc: Pablo Neira Ayuso
    Cc: "David S. Miller"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • When creating a GSO packet segment we may need to set more than
    one checksum in the packet (for instance a TCP checksum and
    UDP checksum for VXLAN encapsulation). To be efficient, we want
    to do checksum calculation for any part of the packet at most once.

    This patch adds csum_start offset to skb_gso_cb. This tracks the
    starting offset for skb->csum which is initially set in skb_segment.
    When a protocol needs to compute a transport checksum it calls
    gso_make_checksum which computes the checksum value from the start
    of transport header to csum_start and then adds in skb->csum to get
    the full checksum. skb->csum and csum_start are then updated to reflect
    the checksum of the resultant packet starting from the transport header.

    This patch also adds a flag to skbuff, encap_hdr_csum, which is set
    in *gso_segment fucntions to indicate that a tunnel protocol needs
    checksum calculation

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

04 Jun, 2014

4 commits

  • Conflicts:
    include/net/inetpeer.h
    net/ipv6/output_core.c

    Changes in net were fixing bugs in code removed in net-next.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When we jump to free_pcpu on failure in alloc_netdev_mqs()
    rx and tx queues are not yet allocated, so no need to free them.

    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • It is possible that ->newlink() fails before registering
    the device, in this case we should just free it, it's
    safe to call free_netdev().

    Fixes: commit 0e0eee2465df77bcec2 (net: correct error path in rtnl_newlink())
    Cc: David S. Miller
    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • …el/git/tip/tip into next

    Pull core locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - reduced/streamlined smp_mb__*() interface that allows more usecases
    and makes the existing ones less buggy, especially in rarer
    architectures

    - add rwsem implementation comments

    - bump up lockdep limits"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    rwsem: Add comments to explain the meaning of the rwsem's count field
    lockdep: Increase static allocations
    arch: Mass conversion of smp_mb__*()
    arch,doc: Convert smp_mb__*()
    arch,xtensa: Convert smp_mb__*()
    arch,x86: Convert smp_mb__*()
    arch,tile: Convert smp_mb__*()
    arch,sparc: Convert smp_mb__*()
    arch,sh: Convert smp_mb__*()
    arch,score: Convert smp_mb__*()
    arch,s390: Convert smp_mb__*()
    arch,powerpc: Convert smp_mb__*()
    arch,parisc: Convert smp_mb__*()
    arch,openrisc: Convert smp_mb__*()
    arch,mn10300: Convert smp_mb__*()
    arch,mips: Convert smp_mb__*()
    arch,metag: Convert smp_mb__*()
    arch,m68k: Convert smp_mb__*()
    arch,m32r: Convert smp_mb__*()
    arch,ia64: Convert smp_mb__*()
    ...

    Linus Torvalds
     

03 Jun, 2014

6 commits

  • Ben Hutchings says:

    ====================
    Pull request: Fixes for new ethtool RSS commands

    This addresses several problems I previously identified with the new
    ETHTOOL_{G,S}RSSH commands:

    1. Missing validation of reserved parameters
    2. Vague documentation
    3. Use of unnamed magic number
    4. No consolidation with existing driver operations

    I don't currently have access to suitable network hardware, but have
    tested these changes with a dummy driver that can support various
    combinations of operations and sizes, together with (a) Debian's ethtool
    3.13 (b) ethtool 3.14 with the submitted patch to use ETHTOOL_{G,S}RSSH
    and minor adjustment for fixes 1 and 3.

    v2: Update RSS operations in vmxnet3 too
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We should fail rather than silently ignoring use of these extensions.

    Signed-off-by: Ben Hutchings

    Ben Hutchings
     
  • ETHTOOL_{G,S}RXFHINDIR and ETHTOOL_{G,S}RSSH should work for drivers
    regardless of whether they expose the hash key, unless you try to
    set a hash key for a driver that doesn't expose it.

    Signed-off-by: Ben Hutchings
    Acked-by: Jeff Kirsher

    Ben Hutchings
     
  • __sk_prepare_filter() was reworked in commit bd4cf0ed3 (net: filter:
    rework/optimize internal BPF interpreter's instruction set) so that it should
    have uncharged memory once things went wrong. However that work isn't complete.
    Error is handled only in __sk_migrate_filter() while memory can still leak in
    the error path right after sk_chk_filter().

    Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
    Signed-off-by: Leon Yu
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Leon Yu
     
  • Ideally, we would need to generate IP ID using a per destination IP
    generator.

    linux kernels used inet_peer cache for this purpose, but this had a huge
    cost on servers disabling MTU discovery.

    1) each inet_peer struct consumes 192 bytes

    2) inetpeer cache uses a binary tree of inet_peer structs,
    with a nominal size of ~66000 elements under load.

    3) lookups in this tree are hitting a lot of cache lines, as tree depth
    is about 20.

    4) If server deals with many tcp flows, we have a high probability of
    not finding the inet_peer, allocating a fresh one, inserting it in
    the tree with same initial ip_id_count, (cf secure_ip_id())

    5) We garbage collect inet_peer aggressively.

    IP ID generation do not have to be 'perfect'

    Goal is trying to avoid duplicates in a short period of time,
    so that reassembly units have a chance to complete reassembly of
    fragments belonging to one message before receiving other fragments
    with a recycled ID.

    We simply use an array of generators, and a Jenkin hash using the dst IP
    as a key.

    ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
    belongs (it is only used from this file)

    secure_ip_id() and secure_ipv6_id() no longer are needed.

    Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
    unnecessary decrement/increment of the number of segments.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This change provides a function to be used in order to break the
    ndo_set_rx_mode call into a set of address add and remove calls. The code
    is based on the implementation of dev_uc_sync/dev_mc_sync. Since they
    essentially do the same thing but with only one dev I simply named my
    functions __dev_uc_sync/__dev_mc_sync.

    I also implemented an unsync version of the functions as well to allow for
    cleanup on close.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

02 Jun, 2014

3 commits

  • Commit 9739eef13c92 ("net: filter: make BPF conversion more readable")
    started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
    macros from classic BPF.

    However, quite some statements in the filter conversion functions
    remained in the old style which gives a mixture of block macros and
    non block macros in the code. This patch makes the block macros itself
    more readable by using explicit member initialization, and converts
    the remaining ones where possible to remain in a more consistent state.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This patch finally allows us to get rid of the BPF_S_* enum.
    Currently, the code performs unnecessary encode and decode
    workarounds in seccomp and filter migration itself when a filter
    is being attached in order to overcome BPF_S_* encoding which
    is not used anymore by the new interpreter resp. JIT compilers.

    Keeping it around would mean that also in future we would need
    to extend and maintain this enum and related encoders/decoders.
    We can get rid of all that and save us these operations during
    filter attaching. Naturally, also JIT compilers need to be updated
    by this.

    Before JIT conversion is being done, each compiler checks if A
    is being loaded at startup to obtain information if it needs to
    emit instructions to clear A first. Since BPF extensions are a
    subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
    for extensions can be removed at that point. To ease and minimalize
    code changes in the classic JITs, we have introduced bpf_anc_helper().

    Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
    arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
    unfortunately didn't have access, but changes are analogous to
    the rest.

    Joint work with Alexei Starovoitov.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Mircea Gherzan
    Cc: Kees Cook
    Acked-by: Chema Gonzalez
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • After 1e785f48d29a ("net: Start with correct mac_len in
    skb_network_protocol") skb->mac_len is used as a start of the
    calculation in skb_network_protocol() but that is not always correct. If
    skb->protocol == 8021Q/AD, usually the vlan header is already inserted
    in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters
    dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the
    destination mac address (skb->data + VLAN_HLEN) as a type in
    skb_network_protocol() and return vlan_depth == 4. In the case where TSO is
    off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so
    skb_network_protocol() returns a type from inside the packet and
    offset == 22. Also make vlan_depth unsigned as suggested before.
    As suggested by Eric Dumazet, move the while() loop in the if() so we
    can avoid additional testing in fast path.

    Here are few netperf tests + debug printk's to illustrate:
    cat netperf.tso-on.reorder-on.bugged
    - Vlan -> device (reorder on, default, this case is okay)
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.00 7111.54
    [ 81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800
    skb->mac_len 0 vlan_depth 0 type 0x800

    - Vlan -> device (reorder off, bad)
    cat netperf.tso-on.reorder-off.bugged
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.00 241.35
    [ 204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100
    skb->mac_len 0 vlan_depth 4 type 0x5301
    0x5301 are the last two bytes of the destination mac.

    And if we stop TSO, we may get even the following:
    [ 83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100
    skb->mac_len 18 vlan_depth 22 type 0xb84
    Because mac_len already accounts for VLAN_HLEN.

    After the fix:
    cat netperf.tso-on.reorder-off.fixed
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.01 5001.46
    [ 81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100
    skb->mac_len 0 vlan_depth 18 type 0x800

    CC: Vlad Yasevich
    CC: Eric Dumazet
    CC: Daniel Borkman
    CC: David S. Miller

    Fixes:1e785f48d29a ("net: Start with correct mac_len in
    skb_network_protocol")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

31 May, 2014

1 commit

  • Export the symbols to fix the below errors when built as modules:
    ERROR: "tso_build_data" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
    ERROR: "tso_build_hdr" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
    ERROR: "tso_start" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
    ERROR: "tso_count_descs" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
    ERROR: "tso_build_data" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
    ERROR: "tso_build_hdr" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
    ERROR: "tso_start" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
    ERROR: "tso_count_descs" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!

    Signed-off-by: Sachin Kamat
    Acked-by: Ezequiel Garcia
    Signed-off-by: David S. Miller

    Sachin Kamat
     

24 May, 2014

5 commits

  • Conflicts:
    drivers/net/bonding/bond_alb.c
    drivers/net/ethernet/altera/altera_msgdma.c
    drivers/net/ethernet/altera/altera_sgdma.c
    net/ipv6/xfrm6_output.c

    Several cases of overlapping changes.

    The xfrm6_output.c has a bug fix which overlaps the renaming
    of skb->local_df to skb->ignore_df.

    In the Altera TSE driver cases, the register access cleanups
    in net-next overlapped with bug fixes done in net.

    Similarly a bug fix to send ALB packets in the bonding driver using
    the right source address overlaps with cleanups in net-next.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The sk_unattached_filter_create() API is used by BPF filters that
    are not directly attached or related to sockets, and are used in
    team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
    internal managment of obtaining filter blocks and thus already
    have them in kernel memory and set up before calling into
    sk_unattached_filter_create(). As a result, due to __user annotation
    in sock_fprog, sparse triggers false positives (incorrect type in
    assignment [different address space]) when filters are set up before
    passing them to sk_unattached_filter_create(). Therefore, let
    sk_unattached_filter_create() API use sock_fprog_kern to overcome
    this issue.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Lets get rid of this macro. After commit 5bcfedf06f7f ("net: filter:
    simplify label names from jump-table"), labels have become more
    readable due to omission of BPF_ prefix but at the same time more
    generic, so that things like `git grep -n` would not find them. As
    a middle path, lets get rid of the DL macro as it's not strictly
    needed and would otherwise just hide the full name.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Define separate fields in the sock structure for configuring disabling
    checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
    The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
    removed UDP_CSUM_* defines since they are no longer necessary.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • o min_tx_rate puts lower limit on the VF bandwidth. VF is guaranteed
    to have a bandwidth of at least this value.
    max_tx_rate puts cap on the VF bandwidth. VF can have a bandwidth
    of up to this value.

    o A new handler set_vf_rate for attr IFLA_VF_RATE has been introduced
    which takes 4 arguments:
    netdev, VF number, min_tx_rate, max_tx_rate

    o ndo_set_vf_rate replaces ndo_set_vf_tx_rate handler.

    o Drivers that currently implement ndo_set_vf_tx_rate should now call
    ndo_set_vf_rate instead and reject attempt to set a minimum bandwidth
    greater than 0 for IFLA_VF_TX_RATE when IFLA_VF_RATE is not yet
    implemented by driver.

    o If user enters only one of either min_tx_rate or max_tx_rate, then,
    userland should read back the other value from driver and set both
    for IFLA_VF_RATE.
    Drivers that have not yet implemented IFLA_VF_RATE should always
    return min_tx_rate as 0 when read from ip tool.

    o If both IFLA_VF_TX_RATE and IFLA_VF_RATE options are specified, then
    IFLA_VF_RATE should override.

    o Idea is to have consistent display of rate values to user.

    o Usage example: -

    ./ip link set p4p1 vf 0 rate 900

    ./ip link show p4p1
    32: p4p1: mtu 1500 qdisc noop state DOWN mode
    DEFAULT qlen 1000
    link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 900 (Mbps), max_tx_rate 900Mbps
    vf 1 MAC f6:c6:7c:3f:3d:6c
    vf 2 MAC 56:32:43:98:d7:71
    vf 3 MAC d6:be:c3:b5:85:ff
    vf 4 MAC ee:a9:9a:1e:19:14
    vf 5 MAC 4a:d0:4c:07:52:18
    vf 6 MAC 3a:76:44:93:62:f9
    vf 7 MAC 82:e9:e7:e3:15:1a

    ./ip link set p4p1 vf 0 max_tx_rate 300 min_tx_rate 200

    ./ip link show p4p1
    32: p4p1: mtu 1500 qdisc noop state DOWN mode
    DEFAULT qlen 1000
    link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 300 (Mbps), max_tx_rate 300Mbps,
    min_tx_rate 200Mbps
    vf 1 MAC f6:c6:7c:3f:3d:6c
    vf 2 MAC 56:32:43:98:d7:71
    vf 3 MAC d6:be:c3:b5:85:ff
    vf 4 MAC ee:a9:9a:1e:19:14
    vf 5 MAC 4a:d0:4c:07:52:18
    vf 6 MAC 3a:76:44:93:62:f9
    vf 7 MAC 82:e9:e7:e3:15:1a

    ./ip link set p4p1 vf 0 max_tx_rate 600 rate 300

    ./ip link show p4p1
    32: p4p1: mtu 1500 qdisc noop state DOWN mode
    DEFAULT qlen 1000
    link/ether 00:0e:1e:08:b0:f brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 3e:a0:ca:bd:ae:5, tx rate 600 (Mbps), max_tx_rate 600Mbps,
    min_tx_rate 200Mbps
    vf 1 MAC f6:c6:7c:3f:3d:6c
    vf 2 MAC 56:32:43:98:d7:71
    vf 3 MAC d6:be:c3:b5:85:ff
    vf 4 MAC ee:a9:9a:1e:19:14
    vf 5 MAC 4a:d0:4c:07:52:18
    vf 6 MAC 3a:76:44:93:62:f9
    vf 7 MAC 82:e9:e7:e3:15:1a

    Signed-off-by: Sucheta Chakraborty
    Signed-off-by: David S. Miller

    Sucheta Chakraborty
     

23 May, 2014

1 commit


22 May, 2014

1 commit

  • Kernel API for classic BPF socket filters is:

    sk_unattached_filter_create() - validate classic BPF, convert, JIT
    SK_RUN_FILTER() - run it
    sk_unattached_filter_destroy() - destroy socket filter

    Cleanup internal BPF kernel API as following:

    sk_filter_select_runtime() - final step of internal BPF creation.
    Try to JIT internal BPF program, if JIT is not available select interpreter
    SK_RUN_FILTER() - run it
    sk_filter_free() - free internal BPF program

    Disallow direct calls to BPF interpreter. Execution of the BPF program should
    be done with SK_RUN_FILTER() macro.

    Example of internal BPF create, run, destroy:

    struct sk_filter *fp;

    fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
    memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
    fp->len = prog_len;

    sk_filter_select_runtime(fp);

    SK_RUN_FILTER(fp, ctx);

    sk_filter_free(fp);

    Sockets, seccomp, testsuite, tracing are using different ways to populate
    sk_filter, so first steps of program creation are not common.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

19 May, 2014

1 commit