10 Jan, 2019

1 commit

  • [ Upstream commit 8203e2d844d34af247a151d8ebd68553a6e91785 ]

    Sergey reported that forwarding was no longer working
    if fq packet scheduler was used.

    This is caused by the recent switch to EDT model, since incoming
    packets might have been timestamped by __net_timestamp()

    __net_timestamp() uses ktime_get_real(), while fq expects packets
    using CLOCK_MONOTONIC base.

    The fix is to clear skb->tstamp in forwarding paths.

    Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: Eric Dumazet
    Reported-by: Sergey Matyukevich
    Tested-by: Sergey Matyukevich
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

02 Aug, 2018

1 commit

  • After IPv4 packets are forwarded, the priority of the corresponding SKB
    is updated according to the TOS field of IPv4 header. This overrides any
    prioritization done earlier by e.g. an skbedit action or ingress-qos-map
    defined at a vlan device.

    Such overriding may not always be desirable. Even if the packet ends up
    being routed, which implies this is an L3 network node, an administrator
    may wish to preserve whatever prioritization was done earlier on in the
    pipeline.

    Therefore introduce a sysctl that controls this behavior. Keep the
    default value at 1 to maintain backward-compatible behavior.

    Signed-off-by: Petr Machata
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     

05 Mar, 2018

1 commit

  • If you take a GSO skb, and split it into packets, will the network
    length (L3 headers + L4 headers + payload) of those packets be small
    enough to fit within a given MTU?

    skb_gso_validate_mtu gives you the answer to that question. However,
    we recently added to add a way to validate the MAC length of a split GSO
    skb (L2+L3+L4+payload), and the names get confusing, so rename
    skb_gso_validate_mtu to skb_gso_validate_network_len

    Signed-off-by: Daniel Axtens
    Reviewed-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Daniel Axtens
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

04 Nov, 2016

1 commit

  • Some configurations (e.g. geneve interface with default
    MTU of 1500 over an ethernet interface with 1500 MTU) result
    in the transmission of packets that exceed the configured MTU.
    While this should be considered to be a "bad" configuration,
    it is still allowed and should not result in the sending
    of packets that exceed the configured MTU.

    Fix by dropping the assumption in ip_finish_output_gso() that
    locally originated gso packets will never need fragmentation.
    Basic testing using iperf (observing CPU usage and bandwidth)
    have shown no measurable performance impact for traffic not
    requiring fragmentation.

    Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
    Reported-by: Jan Tluka
    Signed-off-by: Lance Richardson
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Lance Richardson
     

20 Jul, 2016

1 commit


04 Jun, 2016

1 commit

  • skb_gso_network_seglen is not enough for checking fragment sizes if
    skb is using GSO_BY_FRAGS as we have to check frag per frag.

    This patch introduces skb_gso_validate_mtu, based on the former, which
    will wrap the use case inside it as all calls to skb_gso_network_seglen
    were to validate if it fits on a given TMU, and improve the check.

    Signed-off-by: Marcelo Ricardo Leitner
    Tested-by: Xin Long
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

28 Apr, 2016

2 commits


02 Mar, 2016

1 commit

  • After commit 52bd2d62ce67 ("net: better skb->sender_cpu and skb->napi_id cohabitation")
    skb_sender_cpu_clear() becomes empty and can be removed.

    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

08 Oct, 2015

1 commit


18 Sep, 2015

5 commits

  • This is immediately motivated by the bridge code that chains functions that
    call into netfilter. Without passing net into the okfns the bridge code would
    need to guess about the best expression for the network namespace to process
    packets in.

    As net is frequently one of the first things computed in continuation functions
    after netfilter has done it's job passing in the desired network namespace is in
    many cases a code simplification.

    To support this change the function dst_output_okfn is introduced to
    simplify passing dst_output as an okfn. For the moment dst_output_okfn
    just silently drops the struct net.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Pass a network namespace parameter into the netfilter hooks. At the
    call site of the netfilter hooks the path a packet is taking through
    the network stack is well known which allows the network namespace to
    be easily and reliabily.

    This allows the replacement of magic code like
    "dev_net(state->in?:state->out)" that appears at the start of most
    netfilter hooks with "state->net".

    In almost all cases the network namespace passed in is derived
    from the first network device passed in, guaranteeing those
    paths will not see any changes in practice.

    The exceptions are:
    xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
    ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
    ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
    ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
    ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
    ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
    ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
    br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

    In all cases these exceptions seem to be a better expression for the
    network namespace the packet is being processed in then the historic
    "dev_net(in?in:out)". I am documenting them in case something odd
    pops up and someone starts trying to track down what happened.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Compute struct net from the input device in ip_forward before it is
    used.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Add a sock paramter to dst_output making dst_output_sk superfluous.
    Add a skb->sk parameter to all of the callers of dst_output
    Have the callers of dst_output_sk call dst_output.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

25 May, 2015

1 commit

  • Send icmp pmtu error if we find that the largest fragment of df-skb
    exceeded the output path mtu.

    The ip output path will still catch this later on but we can avoid the
    forward/postrouting hook traversal by rejecting right away.

    This is what ipv6 already does.

    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

21 Apr, 2015

1 commit

  • Initial discussion was:
    [FYI] xfrm: Don't lookup sk_policy for timewait sockets

    Forwarded frames should not have a socket attached. Especially
    tw sockets will lead to panics later-on in the stack.

    This was observed with TPROXY assigning a tw socket and broken
    policy routing (misconfigured). As a result frame enters
    forwarding path instead of input. We cannot solve this in
    TPROXY as it cannot know that policy routing is broken.

    v2:
    Remove useless comment

    Signed-off-by: Sebastian Poehn
    Signed-off-by: David S. Miller

    Sebastian Pöhn
     

08 Apr, 2015

1 commit

  • On the output paths in particular, we have to sometimes deal with two
    socket contexts. First, and usually skb->sk, is the local socket that
    generated the frame.

    And second, is potentially the socket used to control a tunneling
    socket, such as one the encapsulates using UDP.

    We do not want to disassociate skb->sk when encapsulating in order
    to fix this, because that would break socket memory accounting.

    The most extreme case where this can cause huge problems is an
    AF_PACKET socket transmitting over a vxlan device. We hit code
    paths doing checks that assume they are dealing with an ipv4
    socket, but are actually operating upon the AF_PACKET one.

    Signed-off-by: David S. Miller

    David Miller
     

12 Mar, 2015

1 commit

  • John reported that my previous commit added a regression
    on his router.

    This is because sender_cpu & napi_id share a common location,
    so get_xps_queue() can see garbage and perform an out of bound access.

    We need to make sure sender_cpu is cleared before doing the transmit,
    otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
    this bug.

    Signed-off-by: Eric Dumazet
    Reported-by: John
    Bisected-by: John
    Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices")
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jan, 2015

1 commit

  • Not caching dst_entries which cause redirects could be exploited by hosts
    on the same subnet, causing a severe DoS attack. This effect aggravated
    since commit f88649721268999 ("ipv4: fix dst race in sk_dst_get()").

    Lookups causing redirects will be allocated with DST_NOCACHE set which
    will force dst_release to free them via RCU. Unfortunately waiting for
    RCU grace period just takes too long, we can end up with >1M dst_entries
    waiting to be released and the system will run OOM. rcuos threads cannot
    catch up under high softirq load.

    Attaching the flag to emit a redirect later on to the specific skb allows
    us to cache those dst_entries thus reducing the pressure on allocation
    and deallocation.

    This issue was discovered by Marcelo Leitner.

    Cc: Julian Anastasov
    Signed-off-by: Marcelo Leitner
    Signed-off-by: Florian Westphal
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

13 May, 2014

1 commit

  • As suggested by several people, rename local_df to ignore_df,
    since it means "ignore df bit if it is set".

    Cc: Maciej Żenczykowski
    Cc: Florian Westphal
    Cc: David S. Miller
    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Acked-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    WANG Cong
     

08 May, 2014

2 commits

  • Doing the segmentation in the forward path has one major drawback:

    When using virtio, we may process gso udp packets coming
    from host network stack. In that case, netfilter POSTROUTING
    will see one packet with udp header followed by multiple ip
    fragments.

    Delay the segmentation and do it after POSTROUTING invocation
    to avoid this.

    Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path")
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • local_df means 'ignore DF bit if set', so if its set we're
    allowed to perform ip fragmentation.

    This wasn't noticed earlier because the output path also drops such skbs
    (and emits needed icmp error) and because netfilter ip defrag did not
    set local_df until couple of days ago.

    Only difference is that DF-packets-larger-than MTU now discarded
    earlier (f.e. we avoid pointless netfilter postrouting trip).

    While at it, drop the repeated test ip_exceeds_mtu, checking it once
    is enough...

    Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path")
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

19 Feb, 2014

1 commit


14 Feb, 2014

2 commits

  • Packets which have L2 address different from ours should be
    already filtered before entering into ip_forward().

    Perform that check at the beginning to avoid processing such packets.

    Signed-off-by: Denis Kirjanov
    Signed-off-by: David S. Miller

    Denis Kirjanov
     
  • Marcelo Ricardo Leitner reported problems when the forwarding link path
    has a lower mtu than the incoming one if the inbound interface supports GRO.

    Given:
    Host R1 R2

    Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.

    In this case, the kernel will fail to send ICMP fragmentation needed
    messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
    checks in forward path. Instead, Linux tries to send out packets exceeding
    the mtu.

    When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
    not fragment the packets when forwarding, and again tries to send out
    packets exceeding R1-R2 link mtu.

    This alters the forwarding dstmtu checks to take the individual gso
    segment lengths into account.

    For ipv6, we send out pkt too big error for gso if the individual
    segments are too big.

    For ipv4, we either send icmp fragmentation needed, or, if the DF bit
    is not set, perform software segmentation and let the output path
    create fragments when the packet is leaving the machine.
    It is not 100% correct as the error message will contain the headers of
    the GRO skb instead of the original/segmented one, but it seems to
    work fine in my (limited) tests.

    Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
    sofware segmentation.

    However it turns out that skb_segment() assumes skb nr_frags is related
    to mss size so we would BUG there. I don't want to mess with it considering
    Herbert and Eric disagree on what the correct behavior should be.

    Hannes Frederic Sowa notes that when we would shrink gso_size
    skb_segment would then also need to deal with the case where
    SKB_MAX_FRAGS would be exceeded.

    This uses sofware segmentation in the forward path when we hit ipv4
    non-DF packets and the outgoing link mtu is too small. Its not perfect,
    but given the lack of bug reports wrt. GRO fwd being broken this is a
    rare case anyway. Also its not like this could not be improved later
    once the dust settles.

    Acked-by: Herbert Xu
    Reported-by: Marcelo Ricardo Leitner
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

14 Jan, 2014

1 commit

  • While forwarding we should not use the protocol path mtu to calculate
    the mtu for a forwarded packet but instead use the interface mtu.

    We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
    introduced for multicast forwarding. But as it does not conflict with
    our usage in unicast code path it is perfect for reuse.

    I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
    along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
    dependencies because of IPSKB_FORWARDED.

    Because someone might have written a software which does probe
    destinations manually and expects the kernel to honour those path mtus
    I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
    can disable this new behaviour. We also still use mtus which are locked on a
    route for forwarding.

    The reason for this change is, that path mtus information can be injected
    into the kernel via e.g. icmp_err protocol handler without verification
    of local sockets. As such, this could cause the IPv4 forwarding path to
    wrongfully emit fragmentation needed notifications or start to fragment
    packets along a path.

    Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
    won't be set and further fragmentation logic will use the path mtu to
    determine the fragmentation size. They also recheck packet size with
    help of path mtu discovery and report appropriate errors.

    Cc: Eric Dumazet
    Cc: David Miller
    Cc: John Heffner
    Cc: Steffen Klassert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

09 Oct, 2012

2 commits

  • Add new flag to remember when route is via gateway.
    We will use it to allow rt_gateway to contain address of
    directly connected host for the cases when DST_NOCACHE is
    used or when the NH exception caches per-destination route
    without DST_NOCACHE flag, i.e. when routes are not used for
    other destinations. By this way we force the neighbour
    resolving to work with the routed destination but we
    can use different address in the packet, feature needed
    for IPVS-DR where original packet for virtual IP is routed
    via route to real IP.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • After the change "Adjust semantics of rt->rt_gateway"
    (commit f8126f1d51) rt_gateway can be 0 but ip_forward() compares
    it directly with nexthop. What we want here is to check if traffic
    is to directly connected nexthop and to fail if using gateway.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

08 Jun, 2012

1 commit

  • RFC 4293 defines ipIfStatsOutOctets (similar definition for
    ipSystemStatsOutOctets):

    The total number of octets in IP datagrams delivered to the lower
    layers for transmission. Octets from datagrams counted in
    ipIfStatsOutTransmits MUST be counted here.

    And ipIfStatsOutTransmits:

    The total number of IP datagrams that this entity supplied to the
    lower layers for transmission. This includes datagrams generated
    locally and those forwarded by this entity.

    Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing
    IPSTATS_MIB_OUTFORWDATAGRAMS.

    IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not
    include forwarded datagrams:

    The total number of IP datagrams that local IP user-protocols
    (including ICMP) supplied to IP in requests for transmission. Note
    that this counter does not include any datagrams counted in
    ipIfStatsOutForwDatagrams.

    Signed-off-by: Vincent Bernat
    Signed-off-by: David S. Miller

    Vincent Bernat
     

16 Apr, 2012

1 commit


24 Nov, 2011

1 commit

  • We can not update iph->daddr in ip_options_rcv_srr(), It is too early.
    When some exception ocurred later (eg. in ip_forward() when goto
    sr_failed) we need the ip header be identical to the original one as
    ICMP need it.

    Add a field 'nexthop' in struct ip_options to save nexthop of LSRR
    or SSRR option.

    Signed-off-by: Li Wei
    Signed-off-by: David S. Miller

    Li Wei
     

13 May, 2011

2 commits


11 Jun, 2010

1 commit


20 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2010

1 commit


03 Jun, 2009

1 commit

  • Define three accessors to get/set dst attached to a skb

    struct dst_entry *skb_dst(const struct sk_buff *skb)

    void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

    void skb_dst_drop(struct sk_buff *skb)
    This one should replace occurrences of :
    dst_release(skb->dst)
    skb->dst = NULL;

    Delete skb->dst field

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet