02 Mar, 2015

1 commit

  • Now that we have BPF_PROG_TYPE_SOCKET_FILTER up and running, we can
    remove the test stubs which were added to get the verifier suite up.

    We can just let the test cases probe under socket filter type instead.
    In the fill/spill test case, we cannot (yet) access fields from the
    context (skb), but we may adapt that test case in future.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

01 Mar, 2015

11 commits

  • Ursula Braun says:

    ====================
    s390: network patches for net-next

    here are some s390 related patches for net-next
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • remove Frank Blaschka as S390 NETWORK DRIVERS maintainer

    Acked-by: Frank Blaschka
    Signed-off-by: Ursula Braun
    Signed-off-by: David S. Miller

    Ursula Braun
     
  • This patch adjusts two instances where we were using the (too big)
    struct qeth_ipacmd_setadpparms size instead of the commands' actual
    size. This didn't do any harm, but wasted a few bytes.

    Signed-off-by: Stefan Raspl
    Signed-off-by: Ursula Braun
    Signed-off-by: David S. Miller

    Stefan Raspl
     
  • claw devices are outdated and no longer supported.
    This patch removes the claw driver.

    Signed-off-by: Ursula Braun
    Signed-off-by: David S. Miller

    Ursula Braun
     
  • tcp_fastopen_create_child() is static and should not be exported.

    tcp4_gso_segment() and tcp6_gso_segment() should be static.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This adds support for reporting the actual and maximum combined channels
    count of the hv_netvsc driver via 'ethtool --show-channels'.

    This required adding 'max_chn' to 'struct netvsc_device', and assigning
    it 'rsscap.num_recv_que' in 'rndis_filter_device_add'. Now we can access
    the combined maximum channel count via 'struct netvsc_device' in the
    ethtool callback.

    Signed-off-by: Andrew Schwartzmeyer
    Signed-off-by: Haiyang Zhang
    Signed-off-by: David S. Miller

    Andrew Schwartzmeyer
     
  • Eric Dumazet says:

    ====================
    tcp: tso improvements

    This patch serie reworks tcp_tso_should_defer() a bit
    to get less bursts, and better ECN behavior.

    We also removed tso_deferred field in tcp socket.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Another TCP issue is triggered by ECN.

    Under pressure, receiver gets ECN marks, and send back ACK packets
    with ECE TCP flag. Senders enter CA_CWR state.

    In this state, tcp_tso_should_defer() is short cut :

    if (icsk->icsk_ca_state != TCP_CA_Open)
    goto send_now;

    This means that about all ACK packets we receive are triggering
    a partial send, and because cwnd is kept small, we can only send
    a small amount of data for each incoming ACK,
    which in return generate more ACK packets.

    Allowing CA_Open and CA_CWR states to enable TSO defer in
    tcp_tso_should_defer() brings performance back :
    TSO autodefer has more chance to defer under pressure.

    This patch increases TSO and LRO/GRO efficiency back to normal levels,
    and does not impact overall ECN behavior.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • With sysctl_tcp_min_tso_segs being 4, it is very possible
    that tcp_tso_should_defer() decides not sending last 2 MSS
    of initial window of 10 packets. This also applies if
    autosizing decides to send X MSS per GSO packet, and cwnd
    is not a multiple of X.

    This patch implements an heuristic based on age of first
    skb in write queue : If it was sent very recently (less than half srtt),
    we can predict that no ACK packet will come in less than half rtt,
    so deferring might cause an under utilization of our window.

    This is visible on initial send (IW10) on web servers,
    but more generally on some RPC, as the last part of the message
    might need an extra RTT to get delivered.

    Tested:

    Ran following packetdrill test
    // A simple server-side test that sends exactly an initial window (IW10)
    // worth of packets.

    `sysctl -e -q net.ipv4.tcp_min_tso_segs=4`

    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +.1 < S 0:0(0) win 32792
    +0 > S. 0:0(0) ack 1
    +.1 < . 1:1(0) ack 1 win 257
    +0 accept(3, ..., ...) = 4

    +0 write(4, ..., 14600) = 14600
    +0 > . 1:5841(5840) ack 1 win 457
    +0 > . 5841:11681(5840) ack 1 win 457
    // Following packet should be sent right now.
    +0 > P. 11681:14601(2920) ack 1 win 457

    +.1 < . 1:1(0) ack 14601 win 257

    +0 close(4) = 0
    +0 > F. 14601:14601(0) ack 1
    +.1 < F. 1:1(0) ack 14602 win 257
    +0 > . 14602:14602(0) ack 2

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TSO relies on ability to defer sending a small amount of packets.
    Heuristic is to wait for future ACKS in hope to send more packets at once.
    Current algorithm uses a per socket tso_deferred field as a pseudo timer.

    This pseudo timer relies on future ACK, but there is no guarantee
    we receive them in time.

    Fix would be to use a real timer, but cost of such timer is probably too
    expensive for typical cases.

    This patch changes the logic to test the time of last transmit,
    because we should not add bursts of more than 1ms for any given flow.

    We've used this patch for about two years at Google, before FQ/pacing
    as it would reduce a fair amount of bursts.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Currently the usbnet core does not update the tx_packets statistic for
    drivers with FLAG_MULTI_PACKET and there is no hook in the TX
    completion path where they could do this.

    cdc_ncm and dependent drivers are bumping tx_packets stat on the
    transmit path while asix and sr9800 aren't updating it at all.

    Add a packet count in struct skb_data so these drivers can fill it
    in, initialise it to 1 for other drivers, and add the packet count
    to the tx_packets statistic on completion.

    Signed-off-by: Ben Hutchings
    Tested-by: Bjørn Mork
    Signed-off-by: David S. Miller

    Ben Hutchings
     

28 Feb, 2015

16 commits

  • Erik Hugne says:

    ====================
    tipc: bug fix and some improvements

    Most important is a fix for a nullptr exception that would occur when
    name table subscriptions fail. The remaining patches are performance
    improvements and cosmetic changes.

    v2: remove unnecessary whitespace in patch #2
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • With the exception of infiniband media which does not use media
    offsets, the media address is always located at offset 4 in the
    media info field as defined by the protocol, so we move the
    definition to the generic bearer.h

    Signed-off-by: Erik Hugne
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • The TIPC_MEDIA_ADDR_SIZE and TIPC_MEDIA_ADDR_OFFSET names
    are misleading, as they actually define the size and offset of
    the whole media info field and not the address part. This patch
    does not have any functional changes.

    Signed-off-by: Erik Hugne
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • If a bearer is disabled by manual intervention, all links over that
    bearer should be purged, indicated with the 'shutting_down' flag.
    Otherwise tipc will get confused if a new bearer is enabled using
    a different media type.

    Signed-off-by: Erik Hugne
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • If a subscription request is sent to a topology server
    connection, and any error occurs (malformed request, oom
    or limit reached) while processing this request, TIPC should
    terminate the subscriber connection. While doing so, it tries
    to access fields in an already freed (or never allocated)
    subscription element leading to a nullpointer exception.
    We fix this by removing the subscr_terminate function and
    terminate the connection immediately upon any subscription
    failure.

    Signed-off-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • The TIPC name distributor pushes topology updates to the cluster
    neighbors. Currently this is done in a unicast manner, and the
    skb holding the update is cloned for each cluster member. This
    is unnecessary, as we only modify the destnode field in the header
    so we change it to do pskb_copy instead.

    Signed-off-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • This patch allows TSO being set/unset on the master, so that GSO
    segmentation is done after team layer.

    Similar patch is present for bonding:
    b0ce3508b25e ("bonding: allow TSO being set on bonding master")
    and bridge:
    f902e8812ef6 ("bridge: Add ability to enable TSO")

    Suggested-by: Jiri Prochazka
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Alexander Duyck says:

    ====================
    fib_trie: Remove leaf_info structure

    This patch set removes the leaf_info structure from the IPv4 fib_trie. The
    general idea is that the leaf_info structure itself only held about 6
    actual bits of data, beyond that it was mostly just waste. As such we can
    drop the structure, move the 1 byte representing the prefix/suffix length
    into the fib_alias and just link it all into one list.

    My testing shows that this saves somewhere between 4 to 10ns depending on
    the type of test performed. I'm suspecting that this represents 1 to 2 L1
    cache misses saved per look-up.

    One side effect of this change is that semantic_match_miss will now only
    increment once per leaf instead of once per leaf_info miss. However the
    stat is already skewed now that we perform a preliminary check on the leaf
    as a part of the look-up.

    I also have gone through and addressed a number of ordering issues in the
    first patch since I had misread the behavior of list_add_tail.

    I have since run some additional testing and verified the resulting lists
    are in the same order when combining multiple prefix length and tos values
    in a single leaf.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • At this point the leaf_info hash is redundant. By adding the suffix length
    to the fib_alias hash list we no longer have need of leaf_info as we can
    determine the prefix length from fa_slen. So we can compress things by
    dropping the leaf_info structure from fib_trie and instead directly connect
    the leaves to the fib_alias hash list.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Make use of an empty spot in the alias to store the suffix length so that
    we don't need to pull that information from the leaf_info structure.

    This patch also makes a slight change to the user statistics. Instead of
    incrementing semantic_match_miss once per leaf_info miss we now just
    increment it once per leaf if a match was not found.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This replaces the prefix length variable in the leaf_info structure with a
    suffix length value, or host identifier length in bits. By doing this it
    makes it easier to sort out since the tnodes and leaf are carrying this
    value as well since it is compatible with the ->pos field in tnodes.

    I also cleaned up one spot that had some list manipulation that could be
    simplified. I basically updated it so that we just use hlist_add_head_rcu
    instead of calling hlist_add_before_rcu on the first node in the list.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • There isn't any advantage to having it as a list and by making it an hlist
    we make the fib_alias more compatible with the list_info in terms of the
    type of list used.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Madhu Challa says:

    ====================
    Multicast group join/leave at ip level

    This series enables configuring multicast group join/leave at ip level
    by extending the "ip address" command.

    It adds a new control socket mc_autojoin_sock and ifa_flag IFA_F_MCAUTOJOIN
    to invoke the corresponding igmp group join/leave api.

    Since the igmp group join/leave api takes the rtnl_lock the code had to
    be refactored by adding a shim layer prefixed by __ that can be invoked
    by code that already has the rtnl_lock. This way we avoid proliferation of
    work queues.

    The first patch in this series does the refactoring for igmp v6.
    Its based on igmp v4 changes that were added by Eric Dumazet.

    The second patch in this series does the group join/leave based on the
    setting of the IFA_F_MCAUTOJOIN flag.

    v5:
    - addressed comments from Daniel Borkmann.
    - removed blank line in patch 1/2
    - removed unused variable, const arg in patch 2/2
    v4:
    - addressed comments from Yoshifuji Hideaki.
    - Remove WARN_ON not needed because we return a value from v2.
    - addressed comments from Daniel Borkmann.
    - rename sock to mc_autojoin_sk
    - ip_mc_config() pass ifa so it needs one less argument.
    - igmp_net_{init|destroy}() use inet_ctl_sock_{create|destroy}
    - inet_rtm_newaddr() change scope of ret.
    - igmp_net_init() no need to initialize sock to NULL.
    v3:
    - addressed comments from David Miller.
    - fixed indentation and local variable order.
    v2:
    - addressed comments from Eric Dumazet.
    - removed workqueue and call __ip_mc_{join|leave}_group or
    __ipv6_sock_mc_{join|drop}
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Joining multicast group on ethernet level via "ip maddr" command would
    not work if we have an Ethernet switch that does igmp snooping since
    the switch would not replicate multicast packets on ports that did not
    have IGMP reports for the multicast addresses.

    Linux vxlan interfaces created via "ip link add vxlan" have the group option
    that enables then to do the required join.

    By extending ip address command with option "autojoin" we can get similar
    functionality for openvswitch vxlan interfaces as well as other tunneling
    mechanisms that need to receive multicast traffic. The kernel code is
    structured similar to how the vxlan driver does a group join / leave.

    example:
    ip address add 224.1.1.10/24 dev eth5 autojoin
    ip address del 224.1.1.10/24 dev eth5

    Signed-off-by: Madhu Challa
    Signed-off-by: David S. Miller

    Madhu Challa
     
  • Based on the igmp v4 changes from Eric Dumazet.
    959d10f6bbf6("igmp: add __ip_mc_{join|leave}_group()")

    These changes are needed to perform igmp v6 join/leave while
    RTNL is held.

    Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around
    __ipv6_sock_mc_join and __ipv6_sock_mc_drop to avoid
    proliferation of work queues.

    Signed-off-by: Madhu Challa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Madhu Challa
     
  • In the unlikely event that skb_get_hash is unable to deduce a hash
    in udp_flow_src_port we use a consistent random value instead.
    This is specified in GRE/UDP draft section 3.2.1:
    https://tools.ietf.org/html/draft-ietf-tsvwg-gre-in-udp-encap-04

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

27 Feb, 2015

5 commits

  • my previous patch skipped vlan range optimizations during skb size
    calculations for simplicity.

    This incremental patch considers vlan ranges during
    skb size calculations. This leads to a bit of code duplication
    in the fill and size calculation functions. But, I could not find a
    prettier way to do this. will take any suggestions.

    Previously, I had reused the existing br_get_link_af_size size calculation
    function to calculate skb size for notifications. Reusing it this time
    around creates some change in behaviour issues for the usual
    .get_link_af_size callback.

    This patch adds a new br_get_link_af_size_filtered() function to
    base the size calculation on the incoming filter flag and include
    vlan ranges.

    Signed-off-by: Roopa Prabhu
    Reviewed-by: Scott Feldman
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • Scott Feldman says:

    ====================
    rocker cleanups

    Pushing out some rocker cleanups I've had in my queue for a while. Nothing
    major, just some sync-up with changes that already went into device code
    (hard-coding desc err return values and lport renaming). Also fixup
    port fowarding transitions prompted by some DSA discussions about how to
    restore port state when port leaves bridge.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Cleanup the port forwarding state transitions for the cases when the port
    joins or leaves a bridge, or is brought admin UP or DOWN. When port is
    bridged, we can rely on bridge driver putting port in correct state using
    STP callback into port driver, regardless if bridge is enabled for STP or not.
    When port is not bridged, we can reuse some of the STP code to enabled or
    disable forwarding depending on UP or DOWN.

    Tested by trying all the transitions from bridge/not bridge, and UP/DOWN, and
    verifying port is in the correct forwarding state after each transition.

    Signed-off-by: Scott Feldman
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Scott Feldman
     
  • This is just a rename of physical ports from "lport" to "pport". Not a
    functional change. OF-DPA uses logical ports (lport) for tunnels, but the
    driver (and device) were using "lport" for physical ports. Renaming physical
    ports references to "pport", freeing up "lport" for use later with tunnels.

    Signed-off-by: Scott Feldman
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Scott Feldman
     
  • The rocker device returns error codes if something goes wrong with descriptor
    processing. Originally the device used standard errno codes for different
    errors, but since those errno codes aren't portable across ARCHs, the device
    now returns hard-coded error codes that stay constant across diff ARCHs. Fix
    driver to use those same hard-coded values.

    Signed-off-by: Scott Feldman
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Scott Feldman
     

26 Feb, 2015

6 commits

  • Jeff Kirsher says:

    ====================
    Intel Wired LAN Driver Updates 2015-02-24

    This series contains updates to i40e and i40evf only, which bumps their
    versions to i40e 1.2.9 and i40evf 1.2.3.

    Paul fixes i40e_debug_aq() for big endian machines by adding the
    appropriate LExx_TO_CPU wrappers.

    Catherine adds a requested speed variable to the link_status to store the
    last speeds we requested from the firmware and use the advertised speed
    settings in get_settings in ethtool now that we have it. Due to the
    new code addition, she also refactors get_settings to improve readability
    and to accommodate some of the longer lines of code by adding two
    functions i40e_get_settings_link_up() and i40e_get_settings_link_down().

    Carolyn adds a struct to the VSI struct to keep track of RXNFC settings
    done via ethtool. Adds more information to the interrupt vector
    names, specifically to the VF misc vector name so that we can distinguish
    between all the interrupts.

    Ashish enables the i40evf driver to enable debug prints via ethtool.

    Mitch updates i40e to enable packet split only when IOMMU is in use,
    since it shows a distinct advantage over the single-buffer path
    because it minimizes DMA mapping and unmapping. Also adds the receive
    routine in use to the features log message to be able to print the
    receive packet split status.

    Greg adds the ability to get, set and commit permanently the NPAR
    partition BW configuration through configfs. Enables an application
    to query the i40e driver's private flags to get the status of NPAR
    enablement via ethtool.

    Neerav adds support for bridge offload ndo_ops getlink and setlink
    to enable bridge hardware mode as per the mode set via IFLA_BRIDGE_MODE.
    The support is only enabled in the case of a PF VSI and not available for
    any other VSI type.

    Kevin fixes i40e by ensuring the BUF and FLAG_RD flags are set for
    indirect admin queue command.

    Vasu updates the driver to setup FCoE netdev device type as "fcoe", so that
    it shows up in sysfs as FCoE device.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • To avoid race conditions when using the ds->ports[] array,
    we need to check if the accessed port has been initialized.
    Introduce and use helper function dsa_is_port_initialized
    for that purpose and use it where needed.

    Signed-off-by: Guenter Roeck
    Signed-off-by: David S. Miller

    Guenter Roeck
     
  • Florian Fainelli says:

    ====================
    net: dsa: integration with SWITCHDEV for HW bridging

    This patch set provides the DSA and SWITCHDEV integration bits together and
    modifies the bcm_sf2 driver accordingly such that it works properly with HW
    bridging.

    Changes in v3:

    - add back the null pointer check in dsa_slave_br_port_mask from Guenter
    - slightly rework patch 1 commit message not to mention the function name
    we add in patch 2

    Changes in v2:

    - avoid a race condition in how DSA network devices are created, patch from
    Guenter Roeck
    - provide a consistent and work STP state once a port leaves the bridge
    - retain a bridge device pointer to properly flag port/bridge membership
    - properly flush the ARL (Address Resolution Logic) in bcm_sf2.c
    - properly retain port membership when individually bringing devices up/down
    while they are members of a bridge

    We discussed on the mailing-list the possibility of standardizing a "fdb_flush"
    operation for DSA switch drivers, looking at the Marvell and Broadcom switches,
    I am not convinced this is practical or diserable as the terminologies vary
    here, but there is nothing preventing us from doing it later.

    Many thanks to Guenter and Andrew for both testing and providing feedback.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Implement the bridge join, leave and set_stp callbacks by making that
    we do the following:

    - when a port joins the bridge, all existing ports in the bridge get
    their VLAN control register updated with that joining port
    - the joining port is including all existing bridge ports in its own
    VLAN control register

    The leave operation is fairly similar, special care must be taken to
    make sure that port leaving the bridging is not removing itself from its
    own VLAN control register.

    Since the various BR_* states apply directly to our HW semantics, we
    just need to translate these constants into their corresponding HW
    settings, and voila!

    We make sure to trigger a fast-ageing process for ports that are
    joining/leaving the bridge and transition from incompatible states, this
    is equivalent to triggering an ARL flush for that port.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • In order to support bridging offloads in DSA switch drivers, select
    NET_SWITCHDEV to get access to the port_stp_update and parent_get_id
    NDOs that we are required to implement.

    To facilitate the integratation at the DSA driver level, we implement 3
    types of operations:

    - port_join_bridge
    - port_leave_bridge
    - port_stp_update

    DSA will resolve which switch ports that are currently bridge port
    members as some Switch hardware/drivers need to know about that to limit
    the register programming to just the relevant registers (especially for
    slow MDIO buses).

    We also take care of setting the correct STP state when slave network
    devices are brought up/down while being bridge members.

    Finally, when a port is leaving the bridge, we make sure we set in
    BR_STATE_FORWARDING state, otherwise the bridge layer would leave it
    disabled as a result of having left the bridge.

    Signed-off-by: Florian Fainelli
    Reviewed-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • A network device notifier can be called for one or more of the created
    slave devices before all slave devices have been registered. This can
    result in a mismatch between ds->phys_port_mask and the registered devices
    by the time the call is made, and it can result in a slave device being
    added to a bridge before its entry in ds->ports[] has been initialized.

    Rework the initialization code to initialize entries in ds->ports[] in
    dsa_slave_create. With this change, dsa_slave_create no longer needs
    to return slave_dev but can return an error code instead.

    Signed-off-by: Guenter Roeck
    Signed-off-by: David S. Miller

    Guenter Roeck
     

25 Feb, 2015

1 commit