08 Oct, 2016

1 commit

  • Pull VFS splice updates from Al Viro:
    "There's a bunch of branches this cycle, both mine and from other folks
    and I'd rather send pull requests separately.

    This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
    (and introduction of such). Gets rid of a lot of code in fs/splice.c
    and elsewhere; there will be followups, but these are for the next
    cycle... Some pipe/splice-related cleanups from Miklos in the same
    branch as well"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    pipe: fix comment in pipe_buf_operations
    pipe: add pipe_buf_steal() helper
    pipe: add pipe_buf_confirm() helper
    pipe: add pipe_buf_release() helper
    pipe: add pipe_buf_get() helper
    relay: simplify relay_file_read()
    switch default_file_splice_read() to use of pipe-backed iov_iter
    switch generic_file_splice_read() to use of ->read_iter()
    new iov_iter flavour: pipe-backed
    fuse_dev_splice_read(): switch to add_to_pipe()
    skb_splice_bits(): get rid of callback
    new helper: add_to_pipe()
    splice: lift pipe_lock out of splice_to_pipe()
    splice: switch get_iovec_page_array() to iov_iter
    splice_to_pipe(): don't open-code wakeup_pipe_readers()
    consistent treatment of EFAULT on O_DIRECT read/write

    Linus Torvalds
     

04 Oct, 2016

2 commits

  • skb_vlan_pop/push were too generic, trying to support the cases where
    skb->data is at mac header, and cases where skb->data is arbitrarily
    elsewhere.

    Supporting an arbitrary skb->data was complex and bogus:
    - It failed to unwind skb->data to its original location post actual
    pop/push.
    (Also, semantic is not well defined for unwinding: If data was into
    the eth header, need to use same offset from start; But if data was
    at network header or beyond, need to adjust the original offset
    according to the push/pull)
    - It mangled the rcsum post actual push/pop, without taking into account
    that the eth bytes might already have been pulled out of the csum.

    Most callers (ovs, bpf) already had their skb->data at mac_header upon
    invoking skb_vlan_pop/push.
    Last caller that failed to do so (act_vlan) has been recently fixed.

    Therefore, to simplify things, no longer support arbitrary skb->data
    inputs for skb_vlan_pop/push().

    skb->data is expected to be exactly at mac_header; WARN otherwise.

    Signed-off-by: Shmulik Ladkani
    Cc: Daniel Borkmann
    Cc: Pravin Shelar
    Cc: Jiri Pirko
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     
  • since pipe_lock is the outermost now, we don't need to drop/regain
    socket locks around the call of splice_to_pipe() from skb_splice_bits(),
    which kills the need to have a socket-specific callback; we can just
    call splice_to_pipe() and be done with that.

    Signed-off-by: Al Viro

    Al Viro
     

22 Sep, 2016

3 commits

  • Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing
    skb->protocol to ETH_P_8021Q or ETH_P_8021AD.

    Signed-off-by: Shmulik Ladkani
    Reviewed-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     
  • In 93515d53b1
    "net: move vlan pop/push functions into common code"
    skb_vlan_pop was moved from its private location in openvswitch to
    skbuff common code.

    In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured
    that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was
    considered a no-op).

    This validation was moved as is into the new common 'skb_vlan_pop'.

    Alas, in its original location (openvswitch), there was a guarantee that
    'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
    condition made sense.
    However there's no such guarantee in the generic 'skb_vlan_pop'.

    For short packets received in rx path going through 'skb_vlan_pop',
    this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non
    hw-accel case) or to fail moving next tag into hw-accel tag.

    Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely:
    It is superfluous since inner '__skb_vlan_pop' already verifies there
    are VLAN_ETH_HLEN writable bytes at the mac_header.

    Note this presents a slight change to skb_vlan_pop() users:
    In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now
    returns an error, as opposed to previous "no-op" behavior.
    Existing callers (e.g. tc act vlan, ovs) usually drop the packet if
    'skb_vlan_pop' fails.

    Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code")
    Signed-off-by: Shmulik Ladkani
    Cc: Pravin Shelar
    Reviewed-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     
  • This exports the functionality of extracting the tag from the payload,
    without moving next vlan tag into hw accel tag.

    Signed-off-by: Shmulik Ladkani
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     

20 Sep, 2016

1 commit

  • Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
    gro may build buffers with a frag_list. This can hurt forwarding
    because most NICs can't offload such packets, they need to be
    segmented in software. This patch splits buffers with a frag_list
    at the frag_list pointer into buffers that can be TSO offloaded.

    Signed-off-by: Steffen Klassert
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Steffen Klassert
     

09 Sep, 2016

1 commit

  • Over the years, TCP BDP has increased by several orders of magnitude,
    and some people are considering to reach the 2 Gbytes limit.

    Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
    MSS.

    In presence of packet losses (or reorders), TCP stores incoming packets
    into an out of order queue, and number of skbs sitting there waiting for
    the missing packets to be received can be in the 10^5 range.

    Most packets are appended to the tail of this queue, and when
    packets can finally be transferred to receive queue, we scan the queue
    from its head.

    However, in presence of heavy losses, we might have to find an arbitrary
    point in this queue, involving a linear scan for every incoming packet,
    throwing away cpu caches.

    This patch converts it to a RB tree, to get bounded latencies.

    Yaogong wrote a preliminary patch about 2 years ago.
    Eric did the rebase, added ofo_last_skb cache, polishing and tests.

    Tested with network dropping between 1 and 10 % packets, with good
    success (about 30 % increase of throughput in stress tests)

    Next step would be to also use an RB tree for the write queue at sender
    side ;)

    Signed-off-by: Yaogong Wang
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Ilpo Järvinen
    Acked-By: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Yaogong Wang
     

07 Jul, 2016

1 commit


02 Jul, 2016

1 commit

  • Similar to commit 9b368814b336 ("net: fix bridge multicast packet checksum validation")
    we need to fixup the checksum for CHECKSUM_COMPLETE when
    pushing skb on RX path. Otherwise we get similar splats.

    Cc: Jamal Hadi Salim
    Cc: Tom Herbert
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

04 Jun, 2016

5 commits

  • Signed-off-by: David S. Miller

    David S. Miller
     
  • SCTP has this pecualiarity that its packets cannot be just segmented to
    (P)MTU. Its chunks must be contained in IP segments, padding respected.
    So we can't just generate a big skb, set gso_size to the fragmentation
    point and deliver it to IP layer.

    This patch takes a different approach. SCTP will now build a skb as it
    would be if it was received using GRO. That is, there will be a cover
    skb with protocol headers and children ones containing the actual
    segments, already segmented to a way that respects SCTP RFCs.

    With that, we can tell skb_segment() to just split based on frag_list,
    trusting its sizes are already in accordance.

    This way SCTP can benefit from GSO and instead of passing several
    packets through the stack, it can pass a single large packet.

    v2:
    - Added support for receiving GSO frames, as requested by Dave Miller.
    - Clear skb->cb if packet is GSO (otherwise it's not used by SCTP)
    - Added heuristics similar to what we have in TCP for not generating
    single GSO packets that fills cwnd.
    v3:
    - consider sctphdr size in skb_gso_transport_seglen()
    - rebased due to 5c7cdf339af5 ("gso: Remove arbitrary checks for
    unsupported GSO")

    Signed-off-by: Marcelo Ricardo Leitner
    Tested-by: Xin Long
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • skb_gso_network_seglen is not enough for checking fragment sizes if
    skb is using GSO_BY_FRAGS as we have to check frag per frag.

    This patch introduces skb_gso_validate_mtu, based on the former, which
    will wrap the use case inside it as all calls to skb_gso_network_seglen
    were to validate if it fits on a given TMU, and improve the check.

    Signed-off-by: Marcelo Ricardo Leitner
    Tested-by: Xin Long
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • This patch allows segmenting a skb based on its frags sizes instead of
    based on a fixed value.

    Signed-off-by: Marcelo Ricardo Leitner
    Tested-by: Xin Long
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • sctp GSO requires it and sctp can be compiled as a module, so we need to
    export this function.

    Signed-off-by: Marcelo Ricardo Leitner
    Tested-by: Xin Long
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

11 May, 2016

1 commit

  • There are two instances of an unused variable, `doff' added by
    commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function")
    in pskb_carve_inside_header() and pskb_carve_inside_nonlinear().
    Remove these instances, they are not used.

    Reported by: Daniel Borkmann
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

05 May, 2016

2 commits

  • This patch addresses a possible issue that can occur if we get into any odd
    corner cases where we support TSO for a given protocol but not the checksum
    or scatter-gather offload. There are few drivers floating around that
    setup their tunnels this way and by enforcing the checksum piece we can
    avoid mangling any frames.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • In the event that the number of partial segments is equal to 1 we don't
    really need to perform partial segmentation offload. As such we should
    skip multiplying the MSS and instead just clear the partial_segs value
    since it will not provide any gain to advertise the frame as being GSO when
    it is a single frame.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

26 Apr, 2016

1 commit

  • A pattern of skb usage seen in modules such as RDS-TCP is to
    extract `to_copy' bytes from the received TCP segment, starting
    at some offset `off' into a new skb `clone'. This is done in
    the ->data_ready callback, where the clone skb is queued up for rx on
    the PF_RDS socket, while the parent TCP segment is returned unchanged
    back to the TCP engine.

    The existing code uses the sequence
    clone = skb_clone(..);
    pskb_pull(clone, off, ..);
    pskb_trim(clone, to_copy, ..);
    with the intention of discarding the first `off' bytes. However,
    skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
    up doing a redundant memcpy of bytes that will then get discarded
    in __pskb_pull_tail().

    To avoid this inefficiency, this commit adds pskb_extract() that
    creates the clone, and memcpy's only the relevant header/frag/frag_list
    to the start of `clone'. pskb_trim() is then invoked to trim clone
    down to the requested to_copy bytes.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

24 Apr, 2016

1 commit


16 Apr, 2016

1 commit

  • When __vlan_insert_tag() fails from skb_vlan_push() path due to the
    skb_cow_head(), we need to undo the __skb_push() in the error path
    as well that was done earlier to move skb->data pointer to mac header.

    Moreover, I noticed that when in the non-error path the __skb_pull()
    is done and the original offset to mac header was non-zero, we fixup
    from a wrong skb->data offset in the checksum complete processing.

    So the skb_postpush_rcsum() really needs to be done before __skb_pull()
    where skb->data still points to the mac header start and thus operates
    under the same conditions as in __vlan_insert_tag().

    Fixes: 93515d53b133 ("net: move vlan pop/push functions into common code")
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

15 Apr, 2016

1 commit

  • This patch adds support for something I am referring to as GSO partial.
    The basic idea is that we can support a broader range of devices for
    segmentation if we use fixed outer headers and have the hardware only
    really deal with segmenting the inner header. The idea behind the naming
    is due to the fact that everything before csum_start will be fixed headers,
    and everything after will be the region that is handled by hardware.

    With the current implementation it allows us to add support for the
    following GSO types with an inner TSO_MANGLEID or TSO6 offload:
    NETIF_F_GSO_GRE
    NETIF_F_GSO_GRE_CSUM
    NETIF_F_GSO_IPIP
    NETIF_F_GSO_SIT
    NETIF_F_UDP_TUNNEL
    NETIF_F_UDP_TUNNEL_CSUM

    In the case of hardware that already supports tunneling we may be able to
    extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
    hardware can support updating inner IPv4 headers.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

21 Mar, 2016

1 commit

  • TCP protocol is still used these days, and TCP uses
    clones in its transmit path. We can not optimize linux
    stack assuming it is mostly used in routers, or that TCP
    is dead.

    Fixes: 795bb1c00d ("net: bulk free infrastructure for NAPI context, use napi_consume_skb")
    Signed-off-by: Eric Dumazet
    Cc: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Mar, 2016

1 commit

  • Some drivers reuse/share code paths that free SKBs between NAPI
    and non-NAPI calls. Adjust napi_consume_skb to handle this
    use-case.

    Before, calls from netpoll (w/ IRQs disabled) was handled and
    indicated with a budget zero indication. Use the same zero
    indication to handle calls not originating from NAPI/softirq.
    Simply handled by using dev_consume_skb_any().

    This adds an extra branch+call for the netpoll case (checking
    in_irq() + irqs_disabled()), but that is okay as this is a slowpath.

    Suggested-by: Alexander Duyck
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

10 Mar, 2016

1 commit


09 Mar, 2016

1 commit


02 Mar, 2016

1 commit

  • After commit 52bd2d62ce67 ("net: better skb->sender_cpu and skb->napi_id cohabitation")
    skb_sender_cpu_clear() becomes empty and can be removed.

    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

26 Feb, 2016

1 commit

  • We need to update the skb->csum after pulling the skb, otherwise
    an unnecessary checksum (re)computation can ocure for IGMP/MLD packets
    in the bridge code. Additionally this fixes the following splats for
    network devices / bridge ports with support for and enabled RX checksum
    offloading:

    [...]
    [ 43.986968] eth0: hw csum failure
    [ 43.990344] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0 #2
    [ 43.996193] Hardware name: BCM2709
    [ 43.999647] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [ 44.007432] [] (show_stack) from [] (dump_stack+0x80/0x90)
    [ 44.014695] [] (dump_stack) from [] (__skb_checksum_complete+0x6c/0xac)
    [ 44.023090] [] (__skb_checksum_complete) from [] (ipv6_mc_validate_checksum+0x104/0x178)
    [ 44.032959] [] (ipv6_mc_validate_checksum) from [] (skb_checksum_trimmed+0x130/0x188)
    [ 44.042565] [] (skb_checksum_trimmed) from [] (ipv6_mc_check_mld+0x118/0x338)
    [ 44.051501] [] (ipv6_mc_check_mld) from [] (br_multicast_rcv+0x5dc/0xd00)
    [ 44.060077] [] (br_multicast_rcv) from [] (br_handle_frame_finish+0xac/0x51c)
    [...]

    Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code")
    Reported-by: Álvaro Fernández Rojas
    Signed-off-by: Linus Lüssing
    Signed-off-by: David S. Miller

    Linus Lüssing
     

23 Feb, 2016

1 commit


20 Feb, 2016

1 commit


12 Feb, 2016

2 commits

  • The network stack defers SKBs free, in-case free happens in IRQ or
    when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
    writes SKBs that were free'ed during IRQ to the softirq completion
    queue (softnet_data.completion_queue).

    These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
    in function net_tx_action(). Take advantage of this a use the skb
    defer and flush API, as we are already in softirq context.

    For modern drivers this rarely happens. Although most drivers do call
    dev_kfree_skb_any(), which detects the situation and calls
    __dev_kfree_skb_irq() when needed. This due to netpoll can call from
    IRQ context.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • Discovered that network stack were hitting the kmem_cache/SLUB
    slowpath when freeing SKBs. Doing bulk free with kmem_cache_free_bulk
    can speedup this slowpath.

    NAPI context is a bit special, lets take advantage of that for bulk
    free'ing SKBs.

    In NAPI context we are running in softirq, which gives us certain
    protection. A softirq can run on several CPUs at once. BUT the
    important part is a softirq will never preempt another softirq running
    on the same CPU. This gives us the opportunity to access per-cpu
    variables in softirq context.

    Extend napi_alloc_cache (before only contained page_frag_cache) to be
    a struct with a small array based stack for holding SKBs. Introduce a
    SKB defer and flush API for accessing this.

    Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
    when running in NAPI context. A small trick to handle/detect if we
    are called from netpoll is to see if budget is 0. In that case, we
    need to invoke dev_consume_skb_irq().

    Joint work with Alexander Duyck.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

11 Feb, 2016

4 commits

  • This patch enables us to use inner checksum offloads if provided by
    hardware with outer checksums computed by software.

    It basically reduces encap_hdr_csum to an advisory flag for now, but based
    on the fact that SCTP may be getting segmentation support before long I
    thought we may want to keep it as it is possible we may need to support
    CRC32c and 1's compliment checksum in the same packet at some point in the
    future.

    Signed-off-by: Alexander Duyck
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • The call skb_has_shared_frag is used in the GRE path and skb_checksum_help
    to verify that no frags can be modified by an external entity. This check
    really doesn't belong in the GRE path but in the skb_segment function
    itself. This way any protocol that might be segmented will be performing
    this check before attempting to offload a checksum to software.

    Signed-off-by: Alexander Duyck
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This patch addresses two main issues.

    First in the case of remote checksum offload we were avoiding dealing with
    scatter-gather issues. As a result it would be possible to assemble a
    series of frames that used frags instead of being linearized as they should
    have if remote checksum offload was enabled.

    Second I have updated the code so that we now let GSO take care of doing
    the checksum on the data itself and drop the special case that was added
    for remote checksum offload.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This patch moves the checksum maintained by GSO out of skb->csum and into
    the GSO context block in order to allow for us to work on outer checksums
    while maintaining the inner checksum offsets in the case of the inner
    checksum being offloaded, while the outer checksums will be computed.

    While updating the code I also did a minor cleanu-up on gso_make_checksum.
    The change is mostly to make it so that we store the values and compute the
    checksum instead of computing the checksum and then storing the values we
    needed to update.

    Signed-off-by: Alexander Duyck
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Alexander Duyck
     

09 Feb, 2016

1 commit

  • Devices may have limits on the number of fragments in an skb they support.
    Current codebase uses a constant as maximum for number of fragments one
    skb can hold and use.
    When enabling scatter/gather and running traffic with many small messages
    the codebase uses the maximum number of fragments and may thereby violate
    the max for certain devices.
    The patch introduces a global variable as max number of fragments.

    Signed-off-by: Hans Westgaard Ry
    Reviewed-by: Håkon Bugge
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Hans Westgaard Ry
     

18 Dec, 2015

1 commit

  • Dmitry reported the following out-of-bound access:

    Call Trace:
    [] __asan_report_load4_noabort+0x3e/0x40
    mm/kasan/report.c:294
    [] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880
    [< inline >] SYSC_setsockopt net/socket.c:1746
    [] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729
    [] entry_SYSCALL_64_fastpath+0x16/0x7a
    arch/x86/entry/entry_64.S:185

    This is because we mistake a raw socket as a tcp socket.
    We should check both sk->sk_type and sk->sk_protocol to ensure
    it is a tcp socket.

    Willem points out __skb_complete_tx_timestamp() needs to fix as well.

    Reported-by: Dmitry Vyukov
    Cc: Willem de Bruijn
    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    WANG Cong
     

15 Dec, 2015

1 commit

  • skb_reorder_vlan_header is called after the vlan header has
    been pulled. As a result the offset of the begining of
    the mac header has been incrased by 4 bytes (VLAN_HLEN).
    When moving the mac addresses, include this incrase in
    the offset calcualation so that the mac addresses are
    copied correctly.

    Fixes: a6e18ff1117 (vlan: Fix untag operations of stacked vlans with REORDER_HEADER off)
    CC: Nicolas Dichtel
    CC: Patrick McHardy
    Signed-off-by: Vladislav Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

18 Nov, 2015

1 commit

  • When we have multiple stacked vlan devices all of which have
    turned off REORDER_HEADER flag, the untag operation does not
    locate the ethernet addresses correctly for nested vlans.
    The reason is that in case of REORDER_HEADER flag being off,
    the outer vlan headers are put back and the mac_len is adjusted
    to account for the presense of the header. Then, the subsequent
    untag operation, for the next level vlan, always use VLAN_ETH_HLEN
    to locate the begining of the ethernet header and that ends up
    being a multiple of 4 bytes short of the actuall beginning
    of the mac header (the multiple depending on the how many vlan
    encapsulations ethere are).

    As a reslult, if there are multiple levles of vlan devices
    with REODER_HEADER being off, the recevied packets end up
    being dropped.

    To solve this, we use skb->mac_len as the offset. The value
    is always set on receive path and starts out as a ETH_HLEN.
    The value is also updated when the vlan header manupations occur
    so we know it will be correct.

    Signed-off-by: Vladislav Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich