31 Jan, 2019

1 commit

  • [ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ]

    In certain cases, pskb_trim_rcsum() may change skb pointers.
    Reinitialize header pointers afterwards to avoid potential
    use-after-frees. Add a note in the documentation of
    pskb_trim_rcsum(). Found by KASAN.

    Signed-off-by: Ross Lagerwall
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ross Lagerwall
     

01 Oct, 2017

1 commit

  • Currently no error is emitted, but this infrastructure will
    used by the next patch to allow source address validation
    for mcast sockets.
    Since early demux can do a route lookup and an ipv4 route
    lookup can return an error code this is consistent with the
    current ipv4 route infrastructure.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

25 Mar, 2017

1 commit

  • Certain system process significant unconnected UDP workload.
    It would be preferrable to disable UDP early demux for those systems
    and enable it for TCP only.

    By disabling UDP demux, we see these slight gains on an ARM64 system-
    782 -> 788Mbps unconnected single stream UDPv4
    633 -> 654Mbps unconnected UDPv4 different sources

    The performance impact can change based on CPU architecure and cache
    sizes. There will not much difference seen if entire UDP hash table
    is in cache.

    Both sysctls are enabled by default to preserve existing behavior.

    v1->v2: Change function pointer instead of adding conditional as
    suggested by Stephen.

    v2->v3: Read once in callers to avoid issues due to compiler
    optimizations. Also update commit message with the tests.

    v3->v4: Store and use read once result instead of querying pointer
    again incorrectly.

    v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}

    Signed-off-by: Subash Abhinov Kasiviswanathan
    Suggested-by: Eric Dumazet
    Cc: Stephen Hemminger
    Cc: Tom Herbert
    Cc: David Miller
    Signed-off-by: David S. Miller

    subashab@codeaurora.org
     

16 Sep, 2016

1 commit

  • The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except
    the global VRF, this replaces skb->dev with the VRF master interface.
    When calling ip_route_input_noref() from here, the checks for forwarding
    look at this master device instead of the initial ingress interface.
    This will allow packets to be routed which normally would be dropped.
    For example, an interface that is not assigned an IP address should
    drop packets, but because the checking is against the master device, the
    packet will be forwarded.

    The fix here is to still call l3mdev_ip_rcv(), but remember the initial
    net_device. This is passed to the other functions within ip_rcv_finish,
    so they still see the original interface.

    Signed-off-by: Mark Tomlinson
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Mark Tomlinson
     

12 May, 2016

2 commits

  • Applications such as OSPF and BFD need the original ingress device not
    the VRF device; the latter can be derived from the former. To that end
    add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
    the skb control buffer similar to IPv6. From there the pktinfo can just
    pull it from cb with the PKTINFO_SKB_CB cast.

    The previous patch moving the skb->dev change to L3 means nothing else
    is needed for IPv6; it just works.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Currently the VRF driver uses the rx_handler to switch the skb device
    to the VRF device. Switching the dev prior to the ip / ipv6 layer
    means the VRF driver has to duplicate IP/IPv6 processing which adds
    overhead and makes features such as retaining the ingress device index
    more complicated than necessary.

    This patch moves the hook to the L3 layer just after the first NF_HOOK
    for PRE_ROUTING. This location makes exposing the original ingress device
    trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
    in the future.

    dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
    with the switched device through the packet taps to maintain current
    behavior (tcpdump can be used on either the vrf device or the enslaved
    devices).

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

28 Apr, 2016

4 commits


17 Feb, 2016

1 commit


11 Feb, 2016

1 commit

  • In order to solve a problem with 802.11, the so-called hole-196 attack,
    add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
    enabled, causes the stack to drop IPv4 unicast packets encapsulated in
    link-layer multi- or broadcast frames. Such frames can (as an attack)
    be created by any member of the same wireless network and transmitted
    as valid encrypted frames since the symmetric key for broadcast frames
    is shared between all stations.

    Additionally, enabling this option provides compliance with a SHOULD
    clause of RFC 1122.

    Reviewed-by: Julian Anastasov
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

30 Jan, 2016

1 commit

  • We should not assume a valid protocol header is present,
    as this is not the case for IPv4 fragments.

    Lets avoid extra cache line misses and potential bugs
    if we actually find a socket and incorrectly uses its dst.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2015

2 commits

  • The function ip_defrag is called on both the input and the output
    paths of the networking stack. In particular conntrack when it is
    tracking outbound packets from the local machine calls ip_defrag.

    So add a struct net parameter and stop making ip_defrag guess which
    network namespace it needs to defragment packets in.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • ip_call_ra_chain is called early in the forwarding chain from
    ip_forward and ip_mr_input, which makes skb->dev the correct
    expression to get the input network device and dev_net(skb->dev) a
    correct expression for the network namespace the packet is being
    processed in.

    Compute the network namespace and store it in a variable to make the
    code clearer.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

18 Sep, 2015

4 commits

  • This is immediately motivated by the bridge code that chains functions that
    call into netfilter. Without passing net into the okfns the bridge code would
    need to guess about the best expression for the network namespace to process
    packets in.

    As net is frequently one of the first things computed in continuation functions
    after netfilter has done it's job passing in the desired network namespace is in
    many cases a code simplification.

    To support this change the function dst_output_okfn is introduced to
    simplify passing dst_output as an okfn. For the moment dst_output_okfn
    just silently drops the struct net.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Pass a network namespace parameter into the netfilter hooks. At the
    call site of the netfilter hooks the path a packet is taking through
    the network stack is well known which allows the network namespace to
    be easily and reliabily.

    This allows the replacement of magic code like
    "dev_net(state->in?:state->out)" that appears at the start of most
    netfilter hooks with "state->net".

    In almost all cases the network namespace passed in is derived
    from the first network device passed in, guaranteeing those
    paths will not see any changes in practice.

    The exceptions are:
    xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
    ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
    ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
    ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
    ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
    ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
    ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
    br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

    In all cases these exceptions seem to be a better expression for the
    network namespace the packet is being processed in then the historic
    "dev_net(in?in:out)". I am documenting them in case something odd
    pops up and someone starts trying to track down what happened.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

22 Jul, 2015

1 commit

  • Introduces a new dst_metadata which enables to carry per packet metadata
    between forwarding and processing elements via the skb->dst pointer.

    The structure is set up to be a union. Thus, each separate type of
    metadata requires its own dst instance. If demand arises to carry
    multiple types of metadata concurrently, metadata dst entries can be
    made stackable.

    The metadata dst entry is refcnt'ed as expected for now but a non
    reference counted use is possible if the reference is forced before
    queueing the skb.

    In order to allow allocating dsts with variable length, the existing
    dst_alloc() is split into a dst_alloc() and dst_init() function. The
    existing dst_init() function to initialize the subsystem is being
    renamed to dst_subsys_init() to make it clear what is what.

    The check before ip_route_input() is changed to ignore metadata dsts
    and drop the dst inside the routing function thus allowing to interpret
    metadata in a later commit.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

08 Apr, 2015

1 commit

  • On the output paths in particular, we have to sometimes deal with two
    socket contexts. First, and usually skb->sk, is the local socket that
    generated the frame.

    And second, is potentially the socket used to control a tunneling
    socket, such as one the encapsulates using UDP.

    We do not want to disassociate skb->sk when encapsulating in order
    to fix this, because that would break socket memory accounting.

    The most extreme case where this can cause huge problems is an
    AF_PACKET socket transmitting over a vxlan device. We hit code
    paths doing checks that assume they are dealing with an ipv4
    socket, but are actually operating upon the AF_PACKET one.

    Signed-off-by: David S. Miller

    David Miller
     

04 Apr, 2015

2 commits

  • The ipv4 code uses a mixture of coding styles. In some instances check
    for non-NULL pointer is done as x != NULL and sometimes as x. x is
    preferred according to checkpatch and this patch makes the code
    consistent by adopting the latter form.

    No changes detected by objdiff.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     
  • The ipv4 code uses a mixture of coding styles. In some instances check
    for NULL pointer is done as x == NULL and sometimes as !x. !x is
    preferred according to checkpatch and this patch makes the code
    consistent by adopting the latter form.

    No changes detected by objdiff.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     

28 Jan, 2014

1 commit

  • I see a memory leak when using a transparent HTTP proxy using TPROXY
    together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):

    unreferenced object 0xffff88008cba4a40 (size 1696):
    comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
    hex dump (first 32 bytes):
    0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2.....
    02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmem_cache_alloc+0xad/0xb9
    [] sk_prot_alloc+0x29/0xc5
    [] sk_clone_lock+0x14/0x283
    [] inet_csk_clone_lock+0xf/0x7b
    [] netlink_broadcast+0x14/0x16
    [] tcp_create_openreq_child+0x1b/0x4c3
    [] tcp_v4_syn_recv_sock+0x38/0x25d
    [] tcp_check_req+0x25c/0x3d0
    [] tcp_v4_do_rcv+0x287/0x40e
    [] ip_route_input_noref+0x843/0xa55
    [] tcp_v4_rcv+0x4c9/0x725
    [] ip_local_deliver_finish+0xe9/0x154
    [] __netif_receive_skb+0x4b2/0x514
    [] process_backlog+0xee/0x1c5
    [] net_rx_action+0xa7/0x200
    [] add_interrupt_randomness+0x39/0x157

    But there are many more, resulting in the machine going OOM after some
    days.

    From looking at the TPROXY code, and with help from Florian, I see
    that the memory leak is introduced in tcp_v4_early_demux():

    void tcp_v4_early_demux(struct sk_buff *skb)
    {
    /* ... */

    iph = ip_hdr(skb);
    th = tcp_hdr(skb);

    if (th->doff < sizeof(struct tcphdr) / 4)
    return;

    sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
    iph->saddr, th->source,
    iph->daddr, ntohs(th->dest),
    skb->skb_iif);
    if (sk) {
    skb->sk = sk;

    where the socket is assigned unconditionally to skb->sk, also bumping
    the refcnt on it. This is problematic, because in our case the skb
    has already a socket assigned in the TPROXY target. This then results
    in the leak I see.

    The very same issue seems to be with IPv6, but haven't tested.

    Reviewed-by: Florian Westphal
    Signed-off-by: Holger Eitzenberger
    Signed-off-by: David S. Miller

    Holger Eitzenberger
     

09 Aug, 2013

1 commit

  • With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
    counters do not count the number of frames, but number of aggregated
    segments.

    Its probably too late to change this now.

    This patch adds four new counters, tracking number of frames, regardless
    of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.

    Ip[6]NoECTPkts : Number of packets received with NOECT
    Ip[6]ECT1Pkts : Number of packets received with ECT(1)
    Ip[6]ECT0Pkts : Number of packets received with ECT(0)
    Ip[6]CEPkts : Number of packets received with Congestion Experienced

    lph37:~# nstat | egrep "Pkts|InReceive"
    IpInReceives 1634137 0.0
    Ip6InReceives 3714107 0.0
    Ip6InNoECTPkts 19205 0.0
    Ip6InECT0Pkts 52651828 0.0
    IpExtInNoECTPkts 33630 0.0
    IpExtInECT0Pkts 15581379 0.0
    IpExtInCEPkts 6 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jul, 2013

1 commit

  • commit 45f00f99d6e ("ipv4: tcp: clean up tcp_v4_early_demux()") added a
    performance regression for non GRO traffic, basically disabling
    IP early demux.

    IPv6 stack resets transport header in ip6_rcv() before calling
    IP early demux in ip6_rcv_finish(), while IPv4 does this only in
    ip_local_deliver_finish(), _after_ IP early demux.

    GRO traffic happened to enable IP early demux because transport header
    is also set in inet_gro_receive()

    Instead of reverting the faulty commit, we can make IPv4/IPv6 behave the
    same : transport_header should be set in ip_rcv() instead of
    ip_local_deliver_finish()

    ip_local_deliver_finish() can also use skb_network_header_len() which is
    faster than ip_hdrlen()

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Apr, 2013

1 commit

  • Add MIB counters for checksum errors in IP layer,
    and TCP/UDP/ICMP layers, to help diagnose problems.

    $ nstat -a | grep Csum
    IcmpInCsumErrors 72 0.0
    TcpInCsumErrors 382 0.0
    UdpInCsumErrors 463221 0.0
    Icmp6InCsumErrors 75 0.0
    Udp6InCsumErrors 173442 0.0
    IpExtInCsumErrors 10884 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Mar, 2013

1 commit

  • I had a report recently of a user trying to use dropwatch to localise some frame
    loss, and they were getting false positives. Turned out they were using a user
    space SCTP stack that used raw sockets to grab frames. When we don't have a
    registered protocol for a given packet, we record it as a drop, even if a raw
    socket receieves the frame. We should only record the drop in the event a raw
    socket doesnt exist to receive the frames

    Tested by the reported successfully

    Signed-off-by: Neil Horman
    Reported-by: William Reich
    Tested-by: William Reich
    CC: "David S. Miller"
    CC: William Reich
    CC: eric.dumazet@gmail.com
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     

06 Feb, 2013

1 commit


31 Jul, 2012

1 commit

  • early_demux() handlers should be called in RCU context, and as we
    use skb_dst_set_noref(skb, dst), caller must not exit from RCU context
    before dst use (skb_dst(skb)) or release (skb_drop(dst))

    Therefore, rcu_read_lock()/rcu_read_unlock() pairs around
    ->early_demux() are confusing and not needed :

    Protocol handlers are already in an RCU read lock section.
    (__netif_receive_skb() does the rcu_read_lock() )

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jul, 2012

2 commits

  • This is the IPv6 missing bits for infrastructure added in commit
    41063e9dd1195 (ipv4: Early TCP socket demux.)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • With the routing cache removal we lost the "noref" code paths on
    input, and this can kill some routing workloads.

    Reinstate the noref path when we hit a cached route in the FIB
    nexthops.

    With help from Eric Dumazet.

    Reported-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     

25 Jul, 2012

1 commit

  • 1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux()
    and fix a potential bug if skb->head was reallocated
    (iph & th pointers were not reloaded)

    TCP stack will pull/check headers anyway.

    2) must reload iph in ip_rcv_finish() after early_demux()
    call since skb->head might have changed.

    3) skb->dev->ifindex can be now replaced by skb->skb_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jul, 2012

1 commit


28 Jun, 2012

3 commits

  • It's completely unnecessary.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7.

    This change has several unwanted side effects:

    1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
    thus never create a real cached route.

    2) All TCP traffic will use DST_NOCACHE and never use the routing
    cache at all.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • DDOS synflood attacks hit badly IP route cache.

    On typical machines, this cache is allowed to hold up to 8 Millions dst
    entries, 256 bytes for each, for a total of 2GB of memory.

    rt_garbage_collect() triggers and tries to cleanup things.

    Eventually route cache is disabled but machine is under fire and might
    OOM and crash.

    This patch exploits the new TCP early demux, to set a nocache
    boolean in case incoming TCP frame is for a not yet ESTABLISHED or
    TIMEWAIT socket.

    This 'nocache' boolean is then used in case dst entry is not found in
    route cache, to create an unhashed dst entry (DST_NOCACHE)

    SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
    output dst for syncookies), so after this patch, a machine is able to
    absorb a DDOS synflood attack without polluting its IP route cache.

    Signed-off-by: Eric Dumazet
    Cc: Hans Schillstrom
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jun, 2012

1 commit


23 Jun, 2012

1 commit

  • This change is meant to add a control for disabling early socket demux.
    The main motivation behind this patch is to provide an option to disable
    the feature as it adds an additional cost to routing that reduces overall
    throughput by up to 5%. For example one of my systems went from 12.1Mpps
    to 11.6 after the early socket demux was added. It looks like the reason
    for the regression is that we are now having to perform two lookups, first
    the one for an established socket, and then the one for the routing table.

    By adding this patch and toggling the value for ip_early_demux to 0 I am
    able to get back to the 12.1Mpps I was previously seeing.

    [ Move local variables in ip_rcv_finish() down into the basic
    block in which they are actually used. -DaveM ]

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

20 Jun, 2012

1 commit

  • Input packet processing for local sockets involves two major demuxes.
    One for the route and one for the socket.

    But we can optimize this down to one demux for certain kinds of local
    sockets.

    Currently we only do this for established TCP sockets, but it could
    at least in theory be expanded to other kinds of connections.

    If a TCP socket is established then it's identity is fully specified.

    This means that whatever input route was used during the three-way
    handshake must work equally well for the rest of the connection since
    the keys will not change.

    Once we move to established state, we cache the receive packet's input
    route to use later.

    Like the existing cached route in sk->sk_dst_cache used for output
    packets, we have to check for route invalidations using dst->obsolete
    and dst->ops->check().

    Early demux occurs outside of a socket locked section, so when a route
    invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
    actually inside of established state packet processing and thus have
    the socket locked.

    Signed-off-by: David S. Miller

    David S. Miller