09 Oct, 2014

2 commits

  • Pull networking updates from David Miller:
    "Most notable changes in here:

    1) By far the biggest accomplishment, thanks to a large range of
    contributors, is the addition of multi-send for transmit. This is
    the result of discussions back in Chicago, and the hard work of
    several individuals.

    Now, when the ->ndo_start_xmit() method of a driver sees
    skb->xmit_more as true, it can choose to defer the doorbell
    telling the driver to start processing the new TX queue entires.

    skb->xmit_more means that the generic networking is guaranteed to
    call the driver immediately with another SKB to send.

    There is logic added to the qdisc layer to dequeue multiple
    packets at a time, and the handling mis-predicted offloads in
    software is now done with no locks held.

    Finally, pktgen is extended to have a "burst" parameter that can
    be used to test a multi-send implementation.

    Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
    virtio_net

    Adding support is almost trivial, so export more drivers to
    support this optimization soon.

    I want to thank, in no particular or implied order, Jesper
    Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
    Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
    David Tat, Hannes Frederic Sowa, and Rusty Russell.

    2) PTP and timestamping support in bnx2x, from Michal Kalderon.

    3) Allow adjusting the rx_copybreak threshold for a driver via
    ethtool, and add rx_copybreak support to enic driver. From
    Govindarajulu Varadarajan.

    4) Significant enhancements to the generic PHY layer and the bcm7xxx
    driver in particular (EEE support, auto power down, etc.) from
    Florian Fainelli.

    5) Allow raw buffers to be used for flow dissection, allowing drivers
    to determine the optimal "linear pull" size for devices that DMA
    into pools of pages. The objective is to get exactly the
    necessary amount of headers into the linear SKB area pre-pulled,
    but no more. The new interface drivers use is eth_get_headlen().
    From WANG Cong, with driver conversions (several had their own
    by-hand duplicated implementations) by Alexander Duyck and Eric
    Dumazet.

    6) Support checksumming more smoothly and efficiently for
    encapsulations, and add "foo over UDP" facility. From Tom
    Herbert.

    7) Add Broadcom SF2 switch driver to DSA layer, from Florian
    Fainelli.

    8) eBPF now can load programs via a system call and has an extensive
    testsuite. Alexei Starovoitov and Daniel Borkmann.

    9) Major overhaul of the packet scheduler to use RCU in several major
    areas such as the classifiers and rate estimators. From John
    Fastabend.

    10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
    Duyck.

    11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
    Dumazet.

    12) Add Datacenter TCP congestion control algorithm support, From
    Florian Westphal.

    13) Reorganize sk_buff so that __copy_skb_header() is significantly
    faster. From Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
    netlabel: directly return netlbl_unlabel_genl_init()
    net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
    net: description of dma_cookie cause make xmldocs warning
    cxgb4: clean up a type issue
    cxgb4: potential shift wrapping bug
    i40e: skb->xmit_more support
    net: fs_enet: Add NAPI TX
    net: fs_enet: Remove non NAPI RX
    r8169:add support for RTL8168EP
    net_sched: copy exts->type in tcf_exts_change()
    wimax: convert printk to pr_foo()
    af_unix: remove 0 assignment on static
    ipv6: Do not warn for informational ICMP messages, regardless of type.
    Update Intel Ethernet Driver maintainers list
    bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
    tipc: fix bug in multicast congestion handling
    net: better IFF_XMIT_DST_RELEASE support
    net/mlx4_en: remove NETDEV_TX_BUSY
    3c59x: fix bad split of cpu_to_le32(pci_map_single())
    net: bcmgenet: fix Tx ring priority programming
    ...

    Linus Torvalds
     
  • David S. Miller
     

08 Oct, 2014

3 commits

  • Pull dmaengine updates from Dan Williams:
    "Even though this has fixes marked for -stable, given the size and the
    needed conflict resolutions this is 3.18-rc1/merge-window material.

    These patches have been languishing in my tree for a long while. The
    fact that I do not have the time to do proper/prompt maintenance of
    this tree is a primary factor in the decision to step down as
    dmaengine maintainer. That and the fact that the bulk of drivers/dma/
    activity is going through Vinod these days.

    The net_dma removal has not been in -next. It has developed simple
    conflicts against mainline and net-next (for-3.18).

    Continuing thanks to Vinod for staying on top of drivers/dma/.

    Summary:

    1/ Step down as dmaengine maintainer see commit 08223d80df38
    "dmaengine maintainer update"

    2/ Removal of net_dma, as it has been marked 'broken' since 3.13
    (commit 77873803363c "net_dma: mark broken"), without reports of
    performance regression.

    3/ Miscellaneous fixes"

    * tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
    net: make tcp_cleanup_rbuf private
    net_dma: revert 'copied_early'
    net_dma: simple removal
    dmaengine maintainer update
    dmatest: prevent memory leakage on error path in thread
    ioat: Use time_before_jiffies()
    dmaengine: fix xor sources continuation
    dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup()
    dma: mv_xor: Remove all callers of mv_xor_slot_cleanup()
    dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call
    ioat: Use pci_enable_msix_exact() instead of pci_enable_msix()
    drivers: dma: Include appropriate header file in dca.c
    drivers: dma: Mark functions as static in dma_v3.c
    dma: mv_xor: Add DMA API error checks
    ioat/dca: Use dev_is_pci() to check whether it is pci device

    Linus Torvalds
     
  • There is no reason to emit a log message for these.

    Based upon a suggestion from Hannes Frederic Sowa.

    Signed-off-by: David S. Miller
    Acked-by: Hannes Frederic Sowa

    David S. Miller
     
  • Testing xmit_more support with netperf and connected UDP sockets,
    I found strange dst refcount false sharing.

    Current handling of IFF_XMIT_DST_RELEASE is not optimal.

    Dropping dst in validate_xmit_skb() is certainly too late in case
    packet was queued by cpu X but dequeued by cpu Y

    The logical point to take care of drop/force is in __dev_queue_xmit()
    before even taking qdisc lock.

    As Julian Anastasov pointed out, need for skb_dst() might come from some
    packet schedulers or classifiers.

    This patch adds new helper to cleanly express needs of various drivers
    or qdiscs/classifiers.

    Drivers that need skb_dst() in their ndo_start_xmit() should call
    following helper in their setup instead of the prior :

    dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
    ->
    netif_keep_dst(dev);

    Instead of using a single bit, we use two bits, one being
    eventually rebuilt in bonding/team drivers.

    The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
    rebuilt in bonding/team. Eventually, we could add something
    smarter later.

    Signed-off-by: Eric Dumazet
    Cc: Julian Anastasov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Oct, 2014

5 commits


06 Oct, 2014

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains another batch with Netfilter/IPVS updates
    for net-next, they are:

    1) Add abstracted ICMP codes to the nf_tables reject expression. We
    introduce four reasons to reject using ICMP that overlap in IPv4
    and IPv6 from the semantic point of view. This should simplify the
    maintainance of dual stack rule-sets through the inet table.

    2) Move nf_send_reset() functions from header files to per-family
    nf_reject modules, suggested by Patrick McHardy.

    3) We have to use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) everywhere in the
    code now that br_netfilter can be modularized. Convert remaining spots
    in the network stack code.

    4) Use rcu_barrier() in the nf_tables module removal path to ensure that
    we don't leave object that are still pending to be released via
    call_rcu (that may likely result in a crash).

    5) Remove incomplete arch 32/64 compat from nft_compat. The original (bad)
    idea was to probe the word size based on the xtables match/target info
    size, but this assumption is wrong when you have to dump the information
    back to userspace.

    6) Allow to filter from prerouting and postrouting in the nf_tables bridge.
    In order to emulate the ebtables NAT chains (which are actually simple
    filter chains with no special semantics), we have support filtering from
    this hooks too.

    7) Add explicit module dependency between xt_physdev and br_netfilter.
    This provides a way to detect if the user needs br_netfilter from
    the configuration path. This should reduce the breakage of the
    br_netfilter modularization.

    8) Cleanup coding style in ip_vs.h, from Simon Horman.

    9) Fix crash in the recently added nf_tables masq expression. We have
    to register/unregister the notifiers to clean up the conntrack table
    entries from the module init/exit path, not from the rule addition /
    deletion path. From Arturo Borrero.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Oct, 2014

1 commit

  • In xmit path, we build a flowi6 which will be used for the output route lookup.
    We are sending a GRE packet, neither IPv4 nor IPv6 encapsulated packet, thus the
    protocol should be IPPROTO_GRE.

    Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
    Reported-by: Matthieu Ternisien d'Ouville
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

04 Oct, 2014

1 commit

  • This patch removes fou[46]_gro_receive and fou[46]_gro_complete
    functions. The v4 or v6 variants were chosen for the UDP offloads
    based on the address family of the socket this is not necessary
    or correct. Alternatively, this patch adds is_ipv6 to napi_gro_skb.
    This is set in udp6_gro_receive and unset in udp4_gro_receive. In
    fou_gro_receive the value is used to select the correct inet_offloads
    for the protocol of the outer IP header.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

03 Oct, 2014

4 commits


02 Oct, 2014

3 commits

  • Call skb_set_inner_protocol to set inner Ethernet protocol to
    protocol being encapsulation by GRE before tunnel_xmit. This is
    needed for GSO if UDP encapsulation (fou) is being done.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV6
    before tunnel_xmit. This is needed if UDP encapsulation (fou) is
    being done.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • skb_udp_segment is the function called from udp4_ufo_fragment to
    segment a UDP tunnel packet. This function currently assumes
    segmentation is transparent Ethernet bridging (i.e. VXLAN
    encapsulation). This patch generalizes the function to
    operate on either Ethertype or IP protocol.

    The inner_protocol field must be set to the protocol of the inner
    header. This can now be either an Ethertype or an IP protocol
    (in a union). A new flag in the skbuff indicates which type is
    effective. skb_set_inner_protocol and skb_set_inner_ipproto
    helper functions were added to set the inner_protocol. These
    functions are called from the point where the tunnel encapsulation
    is occuring.

    When skb_udp_tunnel_segment is called, the function to segment the
    inner packet is selected based on the inner IP or Ethertype. In the
    case of an IP protocol encapsulation, the function is derived from
    inet[6]_offloads. In the case of Ethertype, skb->protocol is
    set to the inner_protocol and skb_mac_gso_segment is called. (GRE
    currently does this, but it might be possible to lookup the protocol
    in offload_base and call the appropriate segmenation function
    directly).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

01 Oct, 2014

1 commit

  • Eric Dumazet noticed that all no-nonexthop or no-gateway routes which
    are already marked DST_HOST (e.g. input routes routes) will always be
    invalidated during sk_dst_check. Thus per-socket dst caching absolutely
    had no effect and early demuxing had no effect.

    Thus this patch removes rt6i_genid: fn_sernum already gets modified during
    add operations, so we only must ensure we mutate fn_sernum during ipv6
    address remove operations. This is a fairly cost extensive operations,
    but address removal should not happen that often. Also our mtu update
    functions do the same and we heard no complains so far. xfrm policy
    changes also cause a call into fib6_flush_trees. Also plug a hole in
    rt6_info (no cacheline changes).

    I verified via tracing that this change has effect.

    Cc: Eric Dumazet
    Cc: YOSHIFUJI Hideaki
    Cc: Vlad Yasevich
    Cc: Nicolas Dichtel
    Cc: Martin Lau
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

30 Sep, 2014

1 commit

  • Pablo Neira Ayuso says:

    ====================
    pull request: netfilter/ipvs updates for net-next

    The following patchset contains Netfilter/IPVS updates for net-next,
    most relevantly they are:

    1) Four patches to make the new nf_tables masquerading support
    independent of the x_tables infrastructure. This also resolves a
    compilation breakage if the masquerade target is disabled but the
    nf_tables masq expression is enabled.

    2) ipset updates via Jozsef Kadlecsik. This includes the addition of the
    skbinfo extension that allows you to store packet metainformation in the
    elements. This can be used to fetch and restore this to the packets through
    the iptables SET target, patches from Anton Danilov.

    3) Add the hash:mac set type to ipset, from Jozsef Kadlecsick.

    4) Add simple weighted fail-over scheduler via Simon Horman. This provides
    a fail-over IPVS scheduler (unlike existing load balancing schedulers).
    Connections are directed to the appropriate server based solely on
    highest weight value and server availability, patch from Kenny Mathis.

    5) Support IPv6 real servers in IPv4 virtual-services and vice versa.
    Simon Horman informs that the motivation for this is to allow more
    flexibility in the choice of IP version offered by both virtual-servers
    and real-servers as they no longer need to match: An IPv4 connection
    from an end-user may be forwarded to a real-server using IPv6 and
    vice versa. No ip_vs_sync support yet though. Patches from Alex Gartrell
    and Julian Anastasov.

    6) Add global generation ID to the nf_tables ruleset. When dumping from
    several different object lists, we need a way to identify that an update
    has ocurred so userspace knows that it needs to refresh its lists. This
    also includes a new command to obtain the 32-bits generation ID. The
    less significant 16-bits of this ID is also exposed through res_id field
    in the nfnetlink header to quickly detect the interference and retry when
    there is no risk of ID wraparound.

    7) Move br_netfilter out of the bridge core. The br_netfilter code is
    built in the bridge core by default. This causes problems of different
    kind to people that don't want this: Jesper reported performance drop due
    to the inconditional hook registration and I remember to have read complains
    on netdev from people regarding the unexpected behaviour of our bridging
    stack when br_netfilter is enabled (fragmentation handling, layer 3 and
    upper inspection). People that still need this should easily undo the
    damage by modprobing the new br_netfilter module.

    8) Dump the set policy nf_tables that allows set parameterization. So
    userspace can keep user-defined preferences when saving the ruleset.
    From Arturo Borrero.

    9) Use __seq_open_private() helper function to reduce boiler plate code
    in x_tables, From Rob Jones.

    10) Safer default behaviour in case that you forget to load the protocol
    tracker. Daniel Borkmann and Florian Westphal detected that if your
    ruleset is stateful, you allow traffic to at least one single SCTP port
    and the SCTP protocol tracker is not loaded, then any SCTP traffic may
    be pass through unfiltered. After this patch, the connection tracking
    classifies SCTP/DCCP/UDPlite/GRE packets as invalid if your kernel has
    been compiled with support for these modules.
    ====================

    Trivially resolved conflict in include/linux/skbuff.h, Eric moved some
    netfilter skbuff members around, and the netfilter tree adjusted the
    ifdef guards for the bridging info pointer.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Sep, 2014

6 commits

  • Steffen Klassert says:

    ====================
    pull request (net-next): ipsec-next 2014-09-25

    1) Remove useless hash_resize_mutex in xfrm_hash_resize().
    This mutex is used only there, but xfrm_hash_resize()
    can't be called concurrently at all. From Ying Xue.

    2) Extend policy hashing to prefixed policies based on
    prefix lenght thresholds. From Christophe Gouault.

    3) Make the policy hash table thresholds configurable
    via netlink. From Christophe Gouault.

    4) Remove the maximum authentication length for AH.
    This was needed to limit stack usage. We switched
    already to allocate space, so no need to keep the
    limit. From Herbert Xu.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • TCP maintains lists of skb in write queue, and in receive queues
    (in order and out of order queues)

    Scanning these lists both in input and output path usually requires
    access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq

    These fields are currently in two different cache lines, meaning we
    waste lot of memory bandwidth when these queues are big and flows
    have either packet drops or packet reorders.

    We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
    this header is not used in fast path. This allows TCP to search much faster
    in the skb lists.

    Even with regular flows, we save one cache line miss in fast path.

    Thanks to Christoph Paasch for noticing we need to cleanup
    skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
    and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm
    that it needs. Lets not assume this, as TCP stack might use a different
    place.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ip6gre_tunnel_locate() should not return an existing tunnel if
    create is true. Otherwise it is possible to add the same
    tunnel multiple times without getting an error.

    So return NULL if the tunnel that should be created already
    exists.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • vti6_locate() should not return an existing tunnel if
    create is true. Otherwise it is possible to add the same
    tunnel multiple times without getting an error.

    So return NULL if the tunnel that should be created already
    exists.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • ip6_tnl_locate() should not return an existing tunnel if
    create is true. Otherwise it is possible to add the same
    tunnel multiple times without getting an error.

    So return NULL if the tunnel that should be created already
    exists.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     

28 Sep, 2014

1 commit

  • Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
    and there is no plan to fix it.

    This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
    Reverting the remainder of the net_dma induced changes is deferred to
    subsequent patches.

    Marked for stable due to Roman's report of a memory leak in
    dma_pin_iovec_pages():

    https://lkml.org/lkml/2014/9/3/177

    Cc: Dave Jiang
    Cc: Vinod Koul
    Cc: David Whipple
    Cc: Alexander Duyck
    Cc:
    Reported-by: Roman Gushchin
    Acked-by: David S. Miller
    Signed-off-by: Dan Williams

    Dan Williams
     

27 Sep, 2014

1 commit


26 Sep, 2014

3 commits

  • The send_check logic was only interesting in cases of TCP offload and
    UDP UFO where the checksum needed to be initialized to the pseudo
    header checksum. Now we've moved that logic into the related
    gso_segment functions so gso_send_check is no longer needed.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • In udp[46]_ufo_send_check the UDP checksum initialized to the pseudo
    header checksum. We can move this logic into udp[46]_ufo_fragment.
    After this change udp[64]_ufo_send_check is a no-op.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • In tcp_v[46]_gso_send_check the TCP checksum is initialized to the
    pseudo header checksum using __tcp_v[46]_send_check. We can move this
    logic into new tcp[46]_gso_segment functions to be done when
    ip_summed != CHECKSUM_PARTIAL (ip_summed == CHECKSUM_PARTIAL should be
    the common case, possibly always true when taking GSO path). After this
    change tcp_v[46]_gso_send_check is no-op.

    Signed-off-by: Tom Herbert
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Tom Herbert
     

24 Sep, 2014

2 commits

  • Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
    protected by a lock, meaning that hosts can be stuck hard if all cpus
    want to check ICMP limits.

    When say a DNS or NTP server process is restarted, inetpeer tree grows
    quick and machine comes to its knees.

    iptables can not help because the bottleneck happens before ICMP
    messages are even cooked and sent.

    This patch adds a new global limitation, using a token bucket filter,
    controlled by two new sysctl :

    icmp_msgs_per_sec - INTEGER
    Limit maximal number of ICMP packets sent per second from this host.
    Only messages whose type matches icmp_ratemask are
    controlled by this limit.
    Default: 1000

    icmp_msgs_burst - INTEGER
    icmp_msgs_per_sec controls number of ICMP packets sent per second,
    while icmp_msgs_burst controls the burst size of these packets.
    Default: 50

    Note that if we really want to send millions of ICMP messages per
    second, we might extend idea and infra added in commit 04ca6973f7c1a
    ("ip: make IP identifiers less predictable") :
    add a token bucket in the ip_idents hash and no longer rely on inetpeer.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Conflicts:
    arch/mips/net/bpf_jit.c
    drivers/net/can/flexcan.c

    Both the flexcan and MIPS bpf_jit conflicts were cases of simple
    overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Sep, 2014

2 commits

  • RFC2710 (MLDv1), section 3.7. says:

    The length of a received MLD message is computed by taking the
    IPv6 Payload Length value and subtracting the length of any IPv6
    extension headers present between the IPv6 header and the MLD
    message. If that length is greater than 24 octets, that indicates
    that there are other fields present *beyond* the fields described
    above, perhaps belonging to a *future backwards-compatible* version
    of MLD. An implementation of the version of MLD specified in this
    document *MUST NOT* send an MLD message longer than 24 octets and
    MUST ignore anything past the first 24 octets of a received MLD
    message.

    RFC3810 (MLDv2), section 8.2.1. states for *listeners* regarding
    presence of MLDv1 routers:

    In order to be compatible with MLDv1 routers, MLDv2 hosts MUST
    operate in version 1 compatibility mode. [...] When Host
    Compatibility Mode is MLDv2, a host acts using the MLDv2 protocol
    on that interface. When Host Compatibility Mode is MLDv1, a host
    acts in MLDv1 compatibility mode, using *only* the MLDv1 protocol,
    on that interface. [...]

    While section 8.3.1. specifies *router* behaviour regarding presence
    of MLDv1 routers:

    MLDv2 routers may be placed on a network where there is at least
    one MLDv1 router. The following requirements apply:

    If an MLDv1 router is present on the link, the Querier MUST use
    the *lowest* version of MLD present on the network. This must be
    administratively assured. Routers that desire to be compatible
    with MLDv1 MUST have a configuration option to act in MLDv1 mode;
    if an MLDv1 router is present on the link, the system administrator
    must explicitly configure all MLDv2 routers to act in MLDv1 mode.
    When in MLDv1 mode, the Querier MUST send periodic General Queries
    truncated at the Multicast Address field (i.e., 24 bytes long),
    and SHOULD also warn about receiving an MLDv2 Query (such warnings
    must be rate-limited). The Querier MUST also fill in the Maximum
    Response Delay in the Maximum Response Code field, i.e., the
    exponential algorithm described in section 5.1.3. is not used. [...]

    That means that we should not get queries from different versions of
    MLD. When there's a MLDv1 router present, MLDv2 enforces truncation
    and MRC == MRD (both fields are overlapping within the 24 octet range).

    Section 8.3.2. specifies behaviour in the presence of MLDv1 multicast
    address *listeners*:

    MLDv2 routers may be placed on a network where there are hosts
    that have not yet been upgraded to MLDv2. In order to be compatible
    with MLDv1 hosts, MLDv2 routers MUST operate in version 1 compatibility
    mode. MLDv2 routers keep a compatibility mode per multicast address
    record. The compatibility mode of a multicast address is determined
    from the Multicast Address Compatibility Mode variable, which can be
    in one of the two following states: MLDv1 or MLDv2.

    The Multicast Address Compatibility Mode of a multicast address
    record is set to MLDv1 whenever an MLDv1 Multicast Listener Report is
    *received* for that multicast address. At the same time, the Older
    Version Host Present timer for the multicast address is set to Older
    Version Host Present Timeout seconds. The timer is re-set whenever a
    new MLDv1 Report is received for that multicast address. If the Older
    Version Host Present timer expires, the router switches back to
    Multicast Address Compatibility Mode of MLDv2 for that multicast
    address. [...]

    That means, what can happen is the following scenario, that hosts can
    act in MLDv1 compatibility mode when they previously have received an
    MLDv1 query (or, simply operate in MLDv1 mode-only); and at the same
    time, an MLDv2 router could start up and transmits MLDv2 startup query
    messages while being unaware of the current operational mode.

    Given RFC2710, section 3.7 we would need to answer to that with an MLDv1
    listener report, so that the router according to RFC3810, section 8.3.2.
    would receive that and internally switch to MLDv1 compatibility as well.

    Right now, I believe since the initial implementation of MLDv2, Linux
    hosts would just silently drop such MLDv2 queries instead of replying
    with an MLDv1 listener report, which would prevent a MLDv2 router going
    into fallback mode (until it receives other MLDv1 queries).

    Since the mapping of MRC to MRD in exactly such cases can make use of
    the exponential algorithm from 5.1.3, we cannot [strictly speaking] be
    aware in MLDv1 of the encoding in MRC, it seems also not mentioned by
    the RFC. Since encodings are the same up to 32767, assume in such a
    situation this value as a hard upper limit we would clamp. We have asked
    one of the RFC authors on that regard, and he mentioned that there seem
    not to be any implementations that make use of that exponential algorithm
    on startup messages. In any case, this patch fixes this MLD
    interoperability issue.

    Signed-off-by: Daniel Borkmann
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Unable to load various tunneling modules without this:

    [ 80.679049] fou: Unknown symbol udp_sock_create6 (err 0)
    [ 91.439939] ip6_udp_tunnel: Unknown symbol ip6_local_out (err 0)
    [ 91.439954] ip6_udp_tunnel: Unknown symbol __put_net (err 0)
    [ 91.457792] vxlan: Unknown symbol udp_sock_create6 (err 0)
    [ 91.457831] vxlan: Unknown symbol udp_tunnel6_xmit_skb (err 0)

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

20 Sep, 2014

3 commits

  • Functions supplied in ip6_udp_tunnel.c are only needed when IPV6 is
    selected. When IPV6 is not selected, those functions are stubbed out
    in udp_tunnel.h.

    ==================================================================
    net/ipv6/ip6_udp_tunnel.c:15:5: error: redefinition of 'udp_sock_create6'
    int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
    In file included from net/ipv6/ip6_udp_tunnel.c:9:0:
    include/net/udp_tunnel.h:36:19: note: previous definition of 'udp_sock_create6' was here
    static inline int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
    ==================================================================

    Fixes: fd384412e udp_tunnel: Seperate ipv6 functions into its own file
    Reported-by: kbuild test robot
    Signed-off-by: Andy Zhou
    Signed-off-by: David S. Miller

    Andy Zhou
     
  • Added netlink handling of IP tunnel encapulation paramters, properly
    adjust MTU for encapsulation. Added ip_tunnel_encap call to
    ipip6_tunnel_xmit to actually perform FOU encapsulation.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Want to be able to use these in foo-over-udp offloads, etc.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert