28 Jul, 2018

1 commit

  • [ Upstream commit 3dd1c9a1270736029ffca670e9bd0265f4120600 ]

    The skb hash for locally generated ip[v6] fragments belonging
    to the same datagram can vary in several circumstances:
    * for connected UDP[v6] sockets, the first fragment get its hash
    via set_owner_w()/skb_set_hash_from_sk()
    * for unconnected IPv6 UDPv6 sockets, the first fragment can get
    its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
    auto_flowlabel is enabled

    For the following frags the hash is usually computed via
    skb_get_hash().
    The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
    scenario the egress tx queue can be selected on a per packet basis
    via the skb hash.
    It may also fool flow-oriented schedulers to place fragments belonging
    to the same datagram in different flows.

    Fix the issue by copying the skb hash from the head frag into
    the others at fragmentation time.

    Before this commit:
    perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8"
    netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
    perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
    perf script
    probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
    probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0

    After this commit:
    probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
    probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0

    Fixes: b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit")
    Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     

25 May, 2018

1 commit

  • [ Upstream commit 113f99c3358564a0647d444c2ae34e8b1abfd5b9 ]

    Device features may change during transmission. In particular with
    corking, a device may toggle scatter-gather in between allocating
    and writing to an skb.

    Do not unconditionally assume that !NETIF_F_SG at write time implies
    that the same held at alloc time and thus the skb has sufficient
    tailroom.

    This issue predates git history.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

23 Aug, 2017

1 commit

  • Remove two references to ufo in the udp send path that are no longer
    reachable now that ufo has been removed.

    Commit 85f1bd9a7b5a ("udp: consistently apply ufo or fragmentation")
    is a fix to ufo. It is safe to revert what remains of it.

    Also, no skb can enter ip_append_page with skb_is_gso true now that
    skb_shinfo(skb)->gso_type is no longer set in ip_append_page/_data.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

07 Aug, 2017

1 commit

  • __ip_options_echo() uses the current network namespace, and
    currently retrives it via skb->dst->dev.

    This commit adds an explicit 'net' argument to __ip_options_echo()
    and update all the call sites to provide it, usually via a simpler
    sock_net().

    After this change, __ip_options_echo() no more needs to access
    skb->dst and we can drop a couple of hack to preserve such
    info in the rx path.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

21 Jul, 2017

1 commit


18 Jul, 2017

1 commit


16 Jul, 2017

1 commit

  • Some time ago David Woodhouse reported skb_under_panic
    when we try to push ethernet header to fragmented ipv6 skbs.
    It was fixed for ipv6 by Florian Westphal in
    commit 1d325d217c7f ("ipv6: ip6_fragment: fix headroom tests and skb leak")

    However similar problem still exist in ipv4.

    It does not trigger skb_under_panic due paranoid check
    in ip_finish_output2, however according to Alexey Kuznetsov
    current state is abnormal and ip_fragment should be fixed too.

    Signed-off-by: Vasily Averin
    Signed-off-by: David S. Miller

    Vasily Averin
     

04 Jul, 2017

1 commit

  • SYN-ACK responses on a server in response to a SYN from a client
    did not get the injected skb mark that was tagged on the SYN packet.

    Fixes: 84f39b08d786 ("net: support marking accepting TCP sockets")
    Reviewed-by: Lorenzo Colitti
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

24 Jun, 2017

1 commit

  • Our customer encountered stuck NFS writes for blocks starting at specific
    offsets w.r.t. page boundary caused by networking stack sending packets via
    UFO enabled device with wrong checksum. The problem can be reproduced by
    composing a long UDP datagram from multiple parts using MSG_MORE flag:

    sendto(sd, buff, 1000, MSG_MORE, ...);
    sendto(sd, buff, 1000, MSG_MORE, ...);
    sendto(sd, buff, 3000, 0, ...);

    Assume this packet is to be routed via a device with MTU 1500 and
    NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
    this condition is tested (among others) to decide whether to call
    ip_ufo_append_data():

    ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))

    At the moment, we already have skb with 1028 bytes of data which is not
    marked for GSO so that the test is false (fragheaderlen is usually 20).
    Thus we append second 1000 bytes to this skb without invoking UFO. Third
    sendto(), however, has sufficient length to trigger the UFO path so that we
    end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
    uses udp_csum() to calculate the checksum but that assumes all fragments
    have correct checksum in skb->csum which is not true for UFO fragments.

    When checking against MTU, we need to add skb->len to length of new segment
    if we already have a partially filled skb and fragheaderlen only if there
    isn't one.

    In the IPv6 case, skb can only be null if this is the first segment so that
    we have to use headersize (length of the first IPv6 header) rather than
    fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.

    Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
    Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for
    ip6 fragment between __ip6_append_data and ip6_finish_output")
    Signed-off-by: Michal Kubecek
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Michal Kubeček
     

10 Mar, 2017

1 commit

  • commit c146066ab802 ("ipv4: Don't use ufo handling on later transformed
    packets") and commit f89c56ce710a ("ipv6: Don't use ufo handling on
    later transformed packets") added a check that 'rt->dst.header_len' isn't
    zero in order to skip UFO, but it doesn't include IPcomp in transport mode
    where it equals zero.

    Packets, after payload compression, may not require further fragmentation,
    and if original length exceeds MTU, later compressed packets will be
    transmitted incorrectly. This can be reproduced with LTP udp_ipsec.sh test
    on veth device with enabled UFO, MTU is 1500 and UDP payload is 2000:

    * IPv4 case, offset is wrong + unnecessary fragmentation
    udp_ipsec.sh -p comp -m transport -s 2000 &
    tcpdump -ni ltp_ns_veth2
    ...
    IP (tos 0x0, ttl 64, id 45203, offset 0, flags [+],
    proto Compressed IP (108), length 49)
    10.0.0.2 > 10.0.0.1: IPComp(cpi=0x1000)
    IP (tos 0x0, ttl 64, id 45203, offset 1480, flags [none],
    proto UDP (17), length 21) 10.0.0.2 > 10.0.0.1: ip-proto-17

    * IPv6 case, sending small fragments
    udp_ipsec.sh -6 -p comp -m transport -s 2000 &
    tcpdump -ni ltp_ns_veth2
    ...
    IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
    payload length: 37) fd00::2 > fd00::1: IPComp(cpi=0x1000)
    IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
    payload length: 21) fd00::2 > fd00::1: IPComp(cpi=0x1000)

    Fix it by checking 'rt->dst.xfrm' pointer to 'xfrm_state' struct, skip UFO
    if xfrm is set. So the new check will include both cases: IPcomp and IPsec.

    Fixes: c146066ab802 ("ipv4: Don't use ufo handling on later transformed packets")
    Fixes: f89c56ce710a ("ipv6: Don't use ufo handling on later transformed packets")
    Signed-off-by: Alexey Kodanev
    Signed-off-by: David S. Miller

    Alexey Kodanev
     

12 Feb, 2017

1 commit


08 Feb, 2017

2 commits

  • When same struct dst_entry can be used for many different
    neighbours we can not use it for pending confirmations.

    The datagram protocols can use MSG_CONFIRM to confirm the
    neighbour. When used with MSG_PROBE we do not reach the
    code where neighbour is confirmed, so we have to do the
    same slow lookup by using the dst_confirm_neigh() helper.
    When MSG_PROBE is not used, ip_append_data/ip6_append_data
    will set the skb flag dst_pending_confirm.

    Reported-by: YueHaibing
    Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.")
    Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.")
    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Add new skbuff flag to allow protocols to confirm neighbour.
    When same struct dst_entry can be used for many different
    neighbours we can not use it for pending confirmations.

    Add sock_confirm_neigh() helper to confirm the neighbour and
    use it for IPv4, IPv6 and VRF before dst_neigh_output.

    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     

10 Jan, 2017

1 commit

  • Otherwise, RST packets generated by the TCP stack for non-existing
    sockets always have mark 0.
    The mark from the original packet is assigned to the netns_ipv4/6
    socket used to send the response so that it can get copied into the
    response skb when the socket sends it.

    Fixes: e110861f8609 ("net: add a sysctl to reflect the fwmark on replies")
    Cc: Lorenzo Colitti
    Signed-off-by: Pau Espin Pedrol
    Signed-off-by: Pablo Neira Ayuso

    Pau Espin Pedrol
     

25 Dec, 2016

1 commit


20 Dec, 2016

1 commit

  • …_data and ip_finish_output

    There is an inconsistent conditional judgement in __ip_append_data and
    ip_finish_output functions, the variable length in __ip_append_data just
    include the length of application's payload and udp header, don't include
    the length of ip header, but in ip_finish_output use
    (skb->len > ip_skb_dst_mtu(skb)) as judgement, and skb->len include the
    length of ip header.

    That causes some particular application's udp payload whose length is
    between (MTU - IP Header) and MTU were fragmented by ip_fragment even
    though the rst->dev support UFO feature.

    Add the length of ip header to length in __ip_append_data to keep
    consistent conditional judgement as ip_finish_output for ip fragment.

    Signed-off-by: Zheng Li <james.z.li@ericsson.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

    zheng li
     

17 Dec, 2016

1 commit

  • Pull vfs updates from Al Viro:

    - more ->d_init() stuff (work.dcache)

    - pathname resolution cleanups (work.namei)

    - a few missing iov_iter primitives - copy_from_iter_full() and
    friends. Either copy the full requested amount, advance the iterator
    and return true, or fail, return false and do _not_ advance the
    iterator. Quite a few open-coded callers converted (and became more
    readable and harder to fuck up that way) (work.iov_iter)

    - several assorted patches, the big one being logfs removal

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    logfs: remove from tree
    vfs: fix put_compat_statfs64() does not handle errors
    namei: fold should_follow_link() with the step into not-followed link
    namei: pass both WALK_GET and WALK_MORE to should_follow_link()
    namei: invert WALK_PUT logics
    namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
    namei: saner calling conventions for mountpoint_last()
    namei.c: get rid of user_path_parent()
    switch getfrag callbacks to ..._full() primitives
    make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
    [iov_iter] new primitives - copy_from_iter_full() and friends
    don't open-code file_inode()
    ceph: switch to use of ->d_init()
    ceph: unify dentry_operations instances
    lustre: switch to use of ->d_init()

    Linus Torvalds
     

06 Dec, 2016

1 commit


04 Dec, 2016

1 commit

  • Couple conflicts resolved here:

    1) In the MACB driver, a bug fix to properly initialize the
    RX tail pointer properly overlapped with some changes
    to support variable sized rings.

    2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
    overlapping with a reorganization of the driver to support
    ACPI, OF, as well as PCI variants of the chip.

    3) In 'net' we had several probe error path bug fixes to the
    stmmac driver, meanwhile a lot of this code was cleaned up
    and reorganized in 'net-next'.

    4) The cls_flower classifier obtained a helper function in
    'net-next' called __fl_delete() and this overlapped with
    Daniel Borkamann's bug fix to use RCU for object destruction
    in 'net'. It also overlapped with Jiri's change to guard
    the rhashtable_remove_fast() call with a check against
    tc_skip_sw().

    5) In mlx4, a revert bug fix in 'net' overlapped with some
    unrelated changes in 'net-next'.

    6) In geneve, a stale header pointer after pskb_expand_head()
    bug fix in 'net' overlapped with a large reorganization of
    the same code in 'net-next'. Since the 'net-next' code no
    longer had the bug in question, there was nothing to do
    other than to simply take the 'net-next' hunks.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Dec, 2016

1 commit

  • When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

    where skb_gso_segment() relies on skb->protocol to function properly.

    This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
    fixing a bug where GSO packets sent through a sit tunnel are dropped
    when xfrm is involved.

    Cc: stable@vger.kernel.org
    Signed-off-by: Eli Cooper
    Signed-off-by: David S. Miller

    Eli Cooper
     

26 Nov, 2016

1 commit

  • If the cgroup associated with the receiving socket has an eBPF
    programs installed, run them from ip_output(), ip6_output() and
    ip_mc_output(). From mentioned functions we have two socket contexts
    as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
    okfn()."). We explicitly need to use sk instead of skb->sk here,
    since otherwise the same program would run multiple times on egress
    when encap devices are involved, which is not desired in our case.

    eBPF programs used in this context are expected to either return 1 to
    let the packet pass, or != 1 to drop them. The programs have access to
    the skb through bpf_skb_load_bytes(), and the payload starts at the
    network headers (L3).

    Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
    for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
    the feature is unused.

    Signed-off-by: Daniel Mack
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Mack
     

20 Nov, 2016

1 commit

  • 1) cast to "int" is unnecessary:
    u8 will be promoted to int before decrementing,
    small positive numbers fit into "int", so their values won't be changed
    during promotion.

    Once everything is int including loop counters, signedness doesn't
    matter: 32-bit operations will stay 32-bit operations.

    But! Someone tried to make this loop smart by making everything of
    the same type apparently in an attempt to optimise it.
    Do the optimization, just differently.
    Do the cast where it matters. :^)

    2) frag size is unsigned entity and sum of fragments sizes is also
    unsigned.

    Make everything unsigned, leave no MOVSX instruction behind.

    add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
    function old new delta
    skb_cow_data 835 834 -1
    ip_do_fragment 2549 2548 -1
    ip6_fragment 3130 3128 -2
    Total: Before=154865032, After=154865028, chg -0.00%

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

15 Nov, 2016

1 commit


11 Nov, 2016

1 commit


10 Nov, 2016

1 commit

  • Lorenzo noted an Android unit test failed due to e0d56fdd7342:
    "The expectation in the test was that the RST replying to a SYN sent to a
    closed port should be generated with oif=0. In other words it should not
    prefer the interface where the SYN came in on, but instead should follow
    whatever the routing table says it should do."

    Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
    that the oif in the flow is set to the skb_iif only if skb_iif is an L3
    master.

    Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls")
    Reported-by: Lorenzo Colitti
    Signed-off-by: David Ahern
    Tested-by: Lorenzo Colitti
    Acked-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    David Ahern
     

05 Nov, 2016

1 commit

  • - Use the UID in routing lookups made by protocol connect() and
    sendmsg() functions.
    - Make sure that routing lookups triggered by incoming packets
    (e.g., Path MTU discovery) take the UID of the socket into
    account.
    - For packets not associated with a userspace socket, (e.g., ping
    replies) use UID 0 inside the user namespace corresponding to
    the network namespace the socket belongs to. This allows
    all namespaces to apply routing and iptables rules to
    kernel-originated traffic in that namespaces by matching UID 0.
    This is better than using the UID of the kernel socket that is
    sending the traffic, because the UID of kernel sockets created
    at namespace creation time (e.g., the per-processor ICMP and
    TCP sockets) is the UID of the user that created the socket,
    which might not be mapped in the namespace.

    Tested: compiles allnoconfig, allyesconfig, allmodconfig
    Tested: https://android-review.googlesource.com/253302
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

04 Nov, 2016

1 commit

  • Some configurations (e.g. geneve interface with default
    MTU of 1500 over an ethernet interface with 1500 MTU) result
    in the transmission of packets that exceed the configured MTU.
    While this should be considered to be a "bad" configuration,
    it is still allowed and should not result in the sending
    of packets that exceed the configured MTU.

    Fix by dropping the assumption in ip_finish_output_gso() that
    locally originated gso packets will never need fragmentation.
    Basic testing using iperf (observing CPU usage and bandwidth)
    have shown no measurable performance impact for traffic not
    requiring fragmentation.

    Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
    Reported-by: Jan Tluka
    Signed-off-by: Lance Richardson
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Lance Richardson
     

18 Oct, 2016

1 commit

  • Remove the unused but set variable dev in ip_do_fragment to fix the
    following GCC warning when building with 'W=1':

    net/ipv4/ip_output.c: In function ‘ip_do_fragment’:
    net/ipv4/ip_output.c:541:21: warning: variable ‘dev’ set but not used [-Wunused-but-set-variable]

    Signed-off-by: Tobias Klauser
    Signed-off-by: David S. Miller

    Tobias Klauser
     

11 Sep, 2016

3 commits


31 Aug, 2016

1 commit

  • Today mpls iptunnel lwtunnel_output redirect expects the tunnel
    output function to handle fragmentation. This is ok but can be
    avoided if we did not do the mpls output redirect too early.
    ie we could wait until ip fragmentation is done and then call
    mpls output for each ip fragment.

    To make this work we will need,
    1) the lwtunnel state to carry encap headroom
    2) and do the redirect to the encap output handler on the ip fragment
    (essentially do the output redirect after fragmentation)

    This patch adds tunnel headroom in lwtstate to make sure we
    account for tunnel data in mtu calculations during fragmentation
    and adds new xmit redirect handler to redirect to lwtunnel xmit func
    after ip fragmentation.

    This includes IPV6 and some mtu fixes and testing from David Ahern.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    Roopa Prabhu
     

20 Jul, 2016

1 commit


07 Jul, 2016

1 commit


30 Jun, 2016

1 commit

  • ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
    calls ip_sk_use_pmtu which casts sk as an inet_sk).

    However, in the case of UDP tunneling, the skb->sk is not necessarily an
    inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
    tun/tap).

    OTOH, the sk passed as an argument throughout IP stack's output path is
    the one which is of PMTU interest:
    - In case of local sockets, sk is same as skb->sk;
    - In case of a udp tunnel, sk is the tunneling socket.

    Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
    This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'

    Signed-off-by: Shmulik Ladkani
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     

04 Jun, 2016

1 commit

  • skb_gso_network_seglen is not enough for checking fragment sizes if
    skb is using GSO_BY_FRAGS as we have to check frag per frag.

    This patch introduces skb_gso_validate_mtu, based on the former, which
    will wrap the use case inside it as all calls to skb_gso_network_seglen
    were to validate if it fits on a given TMU, and improve the check.

    Signed-off-by: Marcelo Ricardo Leitner
    Tested-by: Xin Long
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

09 Mar, 2016

1 commit


25 Feb, 2016

1 commit

  • Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL
    skbs down. This can easily happen with UDP+IPv4 sockets with the first
    MSG_MORE write smaller than the MTU, second write is a sendfile.

    Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path,
    were we calculate the checksum manually during copying.

    Commit d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked
    sockets") started to exposes this bug.

    Fixes: d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")
    Reported-by: Jiri Benc
    Cc: Jiri Benc
    Reported-by: Wakko Warner
    Cc: Wakko Warner
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

17 Feb, 2016

1 commit