04 Nov, 2016

1 commit

  • Some configurations (e.g. geneve interface with default
    MTU of 1500 over an ethernet interface with 1500 MTU) result
    in the transmission of packets that exceed the configured MTU.
    While this should be considered to be a "bad" configuration,
    it is still allowed and should not result in the sending
    of packets that exceed the configured MTU.

    Fix by dropping the assumption in ip_finish_output_gso() that
    locally originated gso packets will never need fragmentation.
    Basic testing using iperf (observing CPU usage and bandwidth)
    have shown no measurable performance impact for traffic not
    requiring fragmentation.

    Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
    Reported-by: Jan Tluka
    Signed-off-by: Lance Richardson
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Lance Richardson
     

20 Jul, 2016

1 commit


04 Jun, 2016

1 commit

  • skb_gso_network_seglen is not enough for checking fragment sizes if
    skb is using GSO_BY_FRAGS as we have to check frag per frag.

    This patch introduces skb_gso_validate_mtu, based on the former, which
    will wrap the use case inside it as all calls to skb_gso_network_seglen
    were to validate if it fits on a given TMU, and improve the check.

    Signed-off-by: Marcelo Ricardo Leitner
    Tested-by: Xin Long
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

28 Apr, 2016

2 commits


02 Mar, 2016

1 commit

  • After commit 52bd2d62ce67 ("net: better skb->sender_cpu and skb->napi_id cohabitation")
    skb_sender_cpu_clear() becomes empty and can be removed.

    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

08 Oct, 2015

1 commit


18 Sep, 2015

5 commits

  • This is immediately motivated by the bridge code that chains functions that
    call into netfilter. Without passing net into the okfns the bridge code would
    need to guess about the best expression for the network namespace to process
    packets in.

    As net is frequently one of the first things computed in continuation functions
    after netfilter has done it's job passing in the desired network namespace is in
    many cases a code simplification.

    To support this change the function dst_output_okfn is introduced to
    simplify passing dst_output as an okfn. For the moment dst_output_okfn
    just silently drops the struct net.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Pass a network namespace parameter into the netfilter hooks. At the
    call site of the netfilter hooks the path a packet is taking through
    the network stack is well known which allows the network namespace to
    be easily and reliabily.

    This allows the replacement of magic code like
    "dev_net(state->in?:state->out)" that appears at the start of most
    netfilter hooks with "state->net".

    In almost all cases the network namespace passed in is derived
    from the first network device passed in, guaranteeing those
    paths will not see any changes in practice.

    The exceptions are:
    xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
    ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
    ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
    ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
    ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
    ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
    ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
    br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

    In all cases these exceptions seem to be a better expression for the
    network namespace the packet is being processed in then the historic
    "dev_net(in?in:out)". I am documenting them in case something odd
    pops up and someone starts trying to track down what happened.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Compute struct net from the input device in ip_forward before it is
    used.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Add a sock paramter to dst_output making dst_output_sk superfluous.
    Add a skb->sk parameter to all of the callers of dst_output
    Have the callers of dst_output_sk call dst_output.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

25 May, 2015

1 commit

  • Send icmp pmtu error if we find that the largest fragment of df-skb
    exceeded the output path mtu.

    The ip output path will still catch this later on but we can avoid the
    forward/postrouting hook traversal by rejecting right away.

    This is what ipv6 already does.

    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

21 Apr, 2015

1 commit

  • Initial discussion was:
    [FYI] xfrm: Don't lookup sk_policy for timewait sockets

    Forwarded frames should not have a socket attached. Especially
    tw sockets will lead to panics later-on in the stack.

    This was observed with TPROXY assigning a tw socket and broken
    policy routing (misconfigured). As a result frame enters
    forwarding path instead of input. We cannot solve this in
    TPROXY as it cannot know that policy routing is broken.

    v2:
    Remove useless comment

    Signed-off-by: Sebastian Poehn
    Signed-off-by: David S. Miller

    Sebastian Pöhn
     

08 Apr, 2015

1 commit

  • On the output paths in particular, we have to sometimes deal with two
    socket contexts. First, and usually skb->sk, is the local socket that
    generated the frame.

    And second, is potentially the socket used to control a tunneling
    socket, such as one the encapsulates using UDP.

    We do not want to disassociate skb->sk when encapsulating in order
    to fix this, because that would break socket memory accounting.

    The most extreme case where this can cause huge problems is an
    AF_PACKET socket transmitting over a vxlan device. We hit code
    paths doing checks that assume they are dealing with an ipv4
    socket, but are actually operating upon the AF_PACKET one.

    Signed-off-by: David S. Miller

    David Miller
     

12 Mar, 2015

1 commit

  • John reported that my previous commit added a regression
    on his router.

    This is because sender_cpu & napi_id share a common location,
    so get_xps_queue() can see garbage and perform an out of bound access.

    We need to make sure sender_cpu is cleared before doing the transmit,
    otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
    this bug.

    Signed-off-by: Eric Dumazet
    Reported-by: John
    Bisected-by: John
    Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices")
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jan, 2015

1 commit

  • Not caching dst_entries which cause redirects could be exploited by hosts
    on the same subnet, causing a severe DoS attack. This effect aggravated
    since commit f88649721268999 ("ipv4: fix dst race in sk_dst_get()").

    Lookups causing redirects will be allocated with DST_NOCACHE set which
    will force dst_release to free them via RCU. Unfortunately waiting for
    RCU grace period just takes too long, we can end up with >1M dst_entries
    waiting to be released and the system will run OOM. rcuos threads cannot
    catch up under high softirq load.

    Attaching the flag to emit a redirect later on to the specific skb allows
    us to cache those dst_entries thus reducing the pressure on allocation
    and deallocation.

    This issue was discovered by Marcelo Leitner.

    Cc: Julian Anastasov
    Signed-off-by: Marcelo Leitner
    Signed-off-by: Florian Westphal
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

13 May, 2014

1 commit

  • As suggested by several people, rename local_df to ignore_df,
    since it means "ignore df bit if it is set".

    Cc: Maciej Żenczykowski
    Cc: Florian Westphal
    Cc: David S. Miller
    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Acked-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    WANG Cong
     

08 May, 2014

2 commits

  • Doing the segmentation in the forward path has one major drawback:

    When using virtio, we may process gso udp packets coming
    from host network stack. In that case, netfilter POSTROUTING
    will see one packet with udp header followed by multiple ip
    fragments.

    Delay the segmentation and do it after POSTROUTING invocation
    to avoid this.

    Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path")
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • local_df means 'ignore DF bit if set', so if its set we're
    allowed to perform ip fragmentation.

    This wasn't noticed earlier because the output path also drops such skbs
    (and emits needed icmp error) and because netfilter ip defrag did not
    set local_df until couple of days ago.

    Only difference is that DF-packets-larger-than MTU now discarded
    earlier (f.e. we avoid pointless netfilter postrouting trip).

    While at it, drop the repeated test ip_exceeds_mtu, checking it once
    is enough...

    Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path")
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

19 Feb, 2014

1 commit


14 Feb, 2014

2 commits

  • Packets which have L2 address different from ours should be
    already filtered before entering into ip_forward().

    Perform that check at the beginning to avoid processing such packets.

    Signed-off-by: Denis Kirjanov
    Signed-off-by: David S. Miller

    Denis Kirjanov
     
  • Marcelo Ricardo Leitner reported problems when the forwarding link path
    has a lower mtu than the incoming one if the inbound interface supports GRO.

    Given:
    Host R1 R2

    Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.

    In this case, the kernel will fail to send ICMP fragmentation needed
    messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
    checks in forward path. Instead, Linux tries to send out packets exceeding
    the mtu.

    When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
    not fragment the packets when forwarding, and again tries to send out
    packets exceeding R1-R2 link mtu.

    This alters the forwarding dstmtu checks to take the individual gso
    segment lengths into account.

    For ipv6, we send out pkt too big error for gso if the individual
    segments are too big.

    For ipv4, we either send icmp fragmentation needed, or, if the DF bit
    is not set, perform software segmentation and let the output path
    create fragments when the packet is leaving the machine.
    It is not 100% correct as the error message will contain the headers of
    the GRO skb instead of the original/segmented one, but it seems to
    work fine in my (limited) tests.

    Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
    sofware segmentation.

    However it turns out that skb_segment() assumes skb nr_frags is related
    to mss size so we would BUG there. I don't want to mess with it considering
    Herbert and Eric disagree on what the correct behavior should be.

    Hannes Frederic Sowa notes that when we would shrink gso_size
    skb_segment would then also need to deal with the case where
    SKB_MAX_FRAGS would be exceeded.

    This uses sofware segmentation in the forward path when we hit ipv4
    non-DF packets and the outgoing link mtu is too small. Its not perfect,
    but given the lack of bug reports wrt. GRO fwd being broken this is a
    rare case anyway. Also its not like this could not be improved later
    once the dust settles.

    Acked-by: Herbert Xu
    Reported-by: Marcelo Ricardo Leitner
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

14 Jan, 2014

1 commit

  • While forwarding we should not use the protocol path mtu to calculate
    the mtu for a forwarded packet but instead use the interface mtu.

    We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
    introduced for multicast forwarding. But as it does not conflict with
    our usage in unicast code path it is perfect for reuse.

    I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
    along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
    dependencies because of IPSKB_FORWARDED.

    Because someone might have written a software which does probe
    destinations manually and expects the kernel to honour those path mtus
    I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
    can disable this new behaviour. We also still use mtus which are locked on a
    route for forwarding.

    The reason for this change is, that path mtus information can be injected
    into the kernel via e.g. icmp_err protocol handler without verification
    of local sockets. As such, this could cause the IPv4 forwarding path to
    wrongfully emit fragmentation needed notifications or start to fragment
    packets along a path.

    Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
    won't be set and further fragmentation logic will use the path mtu to
    determine the fragmentation size. They also recheck packet size with
    help of path mtu discovery and report appropriate errors.

    Cc: Eric Dumazet
    Cc: David Miller
    Cc: John Heffner
    Cc: Steffen Klassert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

09 Oct, 2012

2 commits

  • Add new flag to remember when route is via gateway.
    We will use it to allow rt_gateway to contain address of
    directly connected host for the cases when DST_NOCACHE is
    used or when the NH exception caches per-destination route
    without DST_NOCACHE flag, i.e. when routes are not used for
    other destinations. By this way we force the neighbour
    resolving to work with the routed destination but we
    can use different address in the packet, feature needed
    for IPVS-DR where original packet for virtual IP is routed
    via route to real IP.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • After the change "Adjust semantics of rt->rt_gateway"
    (commit f8126f1d51) rt_gateway can be 0 but ip_forward() compares
    it directly with nexthop. What we want here is to check if traffic
    is to directly connected nexthop and to fail if using gateway.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

08 Jun, 2012

1 commit

  • RFC 4293 defines ipIfStatsOutOctets (similar definition for
    ipSystemStatsOutOctets):

    The total number of octets in IP datagrams delivered to the lower
    layers for transmission. Octets from datagrams counted in
    ipIfStatsOutTransmits MUST be counted here.

    And ipIfStatsOutTransmits:

    The total number of IP datagrams that this entity supplied to the
    lower layers for transmission. This includes datagrams generated
    locally and those forwarded by this entity.

    Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing
    IPSTATS_MIB_OUTFORWDATAGRAMS.

    IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not
    include forwarded datagrams:

    The total number of IP datagrams that local IP user-protocols
    (including ICMP) supplied to IP in requests for transmission. Note
    that this counter does not include any datagrams counted in
    ipIfStatsOutForwDatagrams.

    Signed-off-by: Vincent Bernat
    Signed-off-by: David S. Miller

    Vincent Bernat
     

16 Apr, 2012

1 commit


24 Nov, 2011

1 commit

  • We can not update iph->daddr in ip_options_rcv_srr(), It is too early.
    When some exception ocurred later (eg. in ip_forward() when goto
    sr_failed) we need the ip header be identical to the original one as
    ICMP need it.

    Add a field 'nexthop' in struct ip_options to save nexthop of LSRR
    or SSRR option.

    Signed-off-by: Li Wei
    Signed-off-by: David S. Miller

    Li Wei
     

13 May, 2011

2 commits


11 Jun, 2010

1 commit


20 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2010

1 commit


03 Jun, 2009

2 commits

  • Define three accessors to get/set dst attached to a skb

    struct dst_entry *skb_dst(const struct sk_buff *skb)

    void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

    void skb_dst_drop(struct sk_buff *skb)
    This one should replace occurrences of :
    dst_release(skb->dst)
    skb->dst = NULL;

    Delete skb->dst field

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

    Delete skb->rtable field

    Setting rtable is not allowed, just set dst instead as rtable is an alias.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Oct, 2008

1 commit


17 Jul, 2008

2 commits